Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Simon Ging

University of Freiburg

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Feb 11, 2024

Simon Ging, María A. Bravo, Thomas Brox

Figure 1 for Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Figure 2 for Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Figure 3 for Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Figure 4 for Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Abstract:The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

* Accepted as Spotlight Paper for ICLR 2024. The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Open-vocabulary Attribute Detection

Nov 23, 2022

María A. Bravo, Sudhanshu Mittal, Simon Ging, Thomas Brox

Abstract:Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute detection performance of several foundation models. Project page https://ovad-benchmark.github.io/

Via

Access Paper or Ask Questions

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Nov 01, 2020

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

Figure 1 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Figure 2 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Figure 3 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Figure 4 for COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Abstract:Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

* 27 pages, 5 figures, 19 tables. To be published in the 34th conference on Neural Information Processing Systems (NeurIPS 2020). The first two authors contributed equally to this work

Via

Access Paper or Ask Questions