Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maria A. Bravo

Localized Vision-Language Matching for Open-vocabulary Object Detection

May 12, 2022

Maria A. Bravo, Sudhanshu Mittal, Thomas Brox

Figure 1 for Localized Vision-Language Matching for Open-vocabulary Object Detection

Figure 2 for Localized Vision-Language Matching for Open-vocabulary Object Detection

Figure 3 for Localized Vision-Language Matching for Open-vocabulary Object Detection

Figure 4 for Localized Vision-Language Matching for Open-vocabulary Object Detection

Abstract:In this work, we propose an open-world object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-world detection approaches while being data-efficient.

Via

Access Paper or Ask Questions

MAIN: Multi-Attention Instance Network for Video Segmentation

Apr 11, 2019

Juan Leon Alcazar, Maria A. Bravo, Ali K. Thabet, Guillaume Jeanneret, Thomas Brox, Pablo Arbelaez, Bernard Ghanem

Figure 1 for MAIN: Multi-Attention Instance Network for Video Segmentation

Figure 2 for MAIN: Multi-Attention Instance Network for Video Segmentation

Figure 3 for MAIN: Multi-Attention Instance Network for Video Segmentation

Figure 4 for MAIN: Multi-Attention Instance Network for Video Segmentation

Abstract:Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose a novel approach that relies exclusively on the integration of generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Network (MAIN), overcomes challenging segmentation scenarios over arbitrary videos without modelling sequence- or instance-specific knowledge. We design MAIN to segment multiple instances in a single forward pass, and optimize it with a novel loss function that favors class agnostic predictions and assigns instance-specific penalties. We achieve state-of-the-art performance on the challenging Youtube-VOS dataset and benchmark, improving the unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at real-time (30.3 FPS).

Via

Access Paper or Ask Questions