Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most VFMs cannot run in real-time, which makes it difficult to transfer them into several products. Thus, this work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment. It contains three different tasks, including interactive segmentation, panoptic segmentation, and video segmentation. We aim to use one model to achieve the above tasks in real time. We first benchmark several strong baselines. Then, we present Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an efficient decoupled decoder to perform prompt-driven decoding. Moreover, we further explore different training strategies and tuning methods to boost co-training performance further.
RAP SAM overview.
Our method contains three visual inputs: image, video, and visual prompts. Utilizing positional encoding, we generate prompt queries from these visual prompts. The learnable object queries, alongside the prompt queries and the feature map F , are directed to the multi-stage decoder. This process generates multi-stage predictions and refined queries. These refined queries engage in cross-attention with F , resulting in the final prediction.
The interactive segmentation results with a single-point prompt (shown in green color).
The visualization result on YouTube-ViS 2019
The visualization results on COCO dataset.
@misc{Xu2023RAP-SAM,
title={RAP-SAM:Towards Real-Time All-Purpose Segment Anything},
author={Shilin Xu,Haobo Yuan,Qingyu Shi,Lu Qi,Jingbo Wang,Yibo Yang,Yining Li,Kai Chen,Yunhai Tong,Bernard Ghanem,Xiangtai Li,Ming-Hsuan Yang},
journal={arXiv pre-print},
year={2023},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.