Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Mar 21, 2024

Tim Salzmann, Markus Ryll, Alex Bewley, Matthias Minderer

Figure 1 for Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Figure 2 for Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Figure 3 for Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Figure 4 for Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Share this with someone who'll enjoy it:

Abstract:Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.

View paper on

Share this with someone who'll enjoy it:

Title:Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Paper and Code