RAP-SAM:Towards Real-Time All-Purpose Segment Anything

1Peking University, 2Nanyang Technology University, 3UC, Merced, 4Shanghai AI Laboratory, 5KAUST, 6Google Research
*Equally contribution Project Leader
empty


We present real-time all-purpose segmentation to segment and recognize objects for image, video, and interactive inputs. In addition to benchmarking, we also propose a simple yet effective baseline, named RAP-SAM, which achieves the best accuracy and speed trade-off among three different tasks. Both real time panoptic segmentation and video instance segmentation are shown at the right.

Comparison of Segmentation Methods. Our proposed RAP-SAM supports various segmentation tasks, and it can run in real time.
Methods SS PS VIS Interactive Multi-Task in One Model Real Time
ICNet
Bi-Seg
YOSO
Mobile-VIS
SAM
Mask2Former
Video K-Net
OneFormer
RAP-SAM(Ours)

Abstract

Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most VFMs cannot run in real-time, which makes it difficult to transfer them into several products. Thus, this work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment. It contains three different tasks, including interactive segmentation, panoptic segmentation, and video segmentation. We aim to use one model to achieve the above tasks in real time. We first benchmark several strong baselines. Then, we present Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an efficient decoupled decoder to perform prompt-driven decoding. Moreover, we further explore different training strategies and tuning methods to boost co-training performance further.

Video

Method

Interpolate start reference image.


RAP SAM overview. Our method contains three visual inputs: image, video, and visual prompts. Utilizing positional encoding, we generate prompt queries from these visual prompts. The learnable object queries, alongside the prompt queries and the feature map F , are directed to the multi-stage decoder. This process generates multi-stage predictions and refined queries. These refined queries engage in cross-attention with F , resulting in the final prediction.


Visualization

1. Interactive Segmentation

Empty

The interactive segmentation results with a single-point prompt (shown in green color).


2. VIS Segmentation

Empty

The visualization result on YouTube-ViS 2019


3. COCO Panoptic Segmentation

Empty

The visualization results on COCO dataset.

BibTeX

 @misc{Xu2023RAP-SAM,
        title={RAP-SAM:Towards Real-Time All-Purpose Segment Anything}, 
        author={Shilin Xu,Haobo Yuan,Qingyu Shi,Lu Qi,Jingbo Wang,Yibo Yang,Yining Li,Kai Chen,Yunhai Tong,Bernard Ghanem,Xiangtai Li,Ming-Hsuan Yang},
        journal={arXiv pre-print},
        year={2023},
  } 

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.