Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thanos Delatolas

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Apr 07, 2025

Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos

Figure 1 for Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Figure 2 for Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Figure 3 for Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Figure 4 for Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Abstract:This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

* Accepted to CVPRW2025

Via

Access Paper or Ask Questions

Learning the What and How of Annotation in Video Object Segmentation

Nov 11, 2023

Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos

Figure 1 for Learning the What and How of Annotation in Video Object Segmentation

Figure 2 for Learning the What and How of Annotation in Video Object Segmentation

Figure 3 for Learning the What and How of Annotation in Video Object Segmentation

Figure 4 for Learning the What and How of Annotation in Video Object Segmentation

Abstract:Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consuming. To reduce this annotation cost, in this paper, we propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation. Unlike the traditional approach, we introduce an agent that predicts iteratively both which frame ("What") to annotate and which annotation type ("How") to use. Then, the annotator annotates only the selected frame that is used to update a VOS module, leading to significant gains in annotation time. We conduct experiments on the MOSE and the DAVIS datasets and we show that: (a) EVA-VOS leads to masks with accuracy close to the human agreement 3.5x faster than the standard way of annotating videos; (b) our frame selection achieves state-of-the-art performance; (c) EVA-VOS yields significant performance gains in terms of annotation time compared to all other methods and baselines.

* Accepted to WACV 2024

Via

Access Paper or Ask Questions