Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

XU Tang

Towards Open-Vocabulary Video Instance Segmentation

Apr 04, 2023

Haochen Wang, Shuai Wang, Cilin Yan, Xiaolong Jiang, XU Tang, Yao Hu, Weidi Xie, Efstratios Gavves

Figure 1 for Towards Open-Vocabulary Video Instance Segmentation

Figure 2 for Towards Open-Vocabulary Video Instance Segmentation

Figure 3 for Towards Open-Vocabulary Video Instance Segmentation

Figure 4 for Towards Open-Vocabulary Video Instance Segmentation

Abstract:Video Instance Segmentation(VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset(LV-VIS), that contains well-annotated objects from 1,212 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Vision-Language Transformer, MindVLT, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of MindVLT on novel categories. We will release the dataset and code to facilitate future endeavors.

Via

Access Paper or Ask Questions