Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

A. Sophia Koepke

Dissecting Temporal Understanding in Text-to-Audio Retrieval

Sep 01, 2024

Andreea-Maria Oncescu, João F. Henriques, A. Sophia Koepke

Figure 1 for Dissecting Temporal Understanding in Text-to-Audio Retrieval

Figure 2 for Dissecting Temporal Understanding in Text-to-Audio Retrieval

Figure 3 for Dissecting Temporal Understanding in Text-to-Audio Retrieval

Figure 4 for Dissecting Temporal Understanding in Text-to-Audio Retrieval

Abstract:Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.

* 9 pages, 5 figures, ACM Multimedia 2024, https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/

Via

Access Paper or Ask Questions

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Apr 09, 2024

David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

Figure 1 for Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Figure 2 for Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Figure 3 for Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Figure 4 for Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Abstract:Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore, the CLIP and CLAP text encoders provide class label embeddings which are combined to boost the performance of the system. We propose a simple yet effective model that only relies on feed-forward neural networks, exploiting the strong generalization capabilities of the new audio, visual and textual features. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL with our new features. Code and data available at: https://github.com/dkurzend/ClipClap-GZSL.

* CVPRw 2024 (L3D-IVU)

Via

Access Paper or Ask Questions

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Feb 29, 2024

Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke

Abstract:Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound.

* 9 pages, 2 figures, 9 tables, Accepted at ICASSP 2024

Via

Access Paper or Ask Questions

Zero-shot audio captioning with audio-language model guidance and audio context keywords

Nov 14, 2023

Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata

Abstract:Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose ZerAuCap, a novel framework for summarising such general audio signals in a text caption without requiring task-specific training. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets. Our code is available at https://github.com/ExplainableML/ZerAuCap.

* NeurIPS 2023 - Machine Learning for Audio Workshop (Oral)

Via

Access Paper or Ask Questions

Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Nov 08, 2023

Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

Abstract:Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available at: https://github.com/ExplainableML/ZS-A2T.

* Published in GCPR 2023

Via

Access Paper or Ask Questions

Video-adverb retrieval with compositional adverb-action embeddings

Sep 26, 2023

Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

Abstract:Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/ReGaDa/.

* BMVC 2023 (Oral)

Via

Access Paper or Ask Questions

Text-to-feature diffusion for audio-visual few-shot learning

Sep 07, 2023

Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

Figure 1 for Text-to-feature diffusion for audio-visual few-shot learning

Figure 2 for Text-to-feature diffusion for audio-visual few-shot learning

Figure 3 for Text-to-feature diffusion for audio-visual few-shot learning

Figure 4 for Text-to-feature diffusion for audio-visual few-shot learning

Abstract:Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

* DAGM GCPR 2023

Via

Access Paper or Ask Questions

Image-free Classifier Injection for Zero-Shot Classification

Aug 21, 2023

Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata

Figure 1 for Image-free Classifier Injection for Zero-Shot Classification

Figure 2 for Image-free Classifier Injection for Zero-Shot Classification

Figure 3 for Image-free Classifier Injection for Zero-Shot Classification

Figure 4 for Image-free Classifier Injection for Zero-Shot Classification

Abstract:Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training. However, such models must be trained from scratch with specialised methods: therefore, access to a training dataset is required when the need for zero-shot classification arises. In this paper, we aim to equip pre-trained models with zero-shot classification capabilities without the use of image data. We achieve this with our proposed Image-free Classifier Injection with Semantics (ICIS) that injects classifiers for new, unseen classes into pre-trained classification models in a post-hoc fashion without relying on image data. Instead, the existing classifier weights and simple class-wise descriptors, such as class names or attributes, are used. ICIS has two encoder-decoder networks that learn to reconstruct classifier weights from descriptors (and vice versa), exploiting (cross-)reconstruction and cosine losses to regularise the decoding process. Notably, ICIS can be cheaply trained and applied directly on top of pre-trained classification models. Experiments on benchmark ZSL datasets show that ICIS produces unseen classifier weights that achieve strong (generalised) zero-shot classification performance. Code is available at https://github.com/ExplainableML/ImageFreeZSL .

* Accepted at ICCV 2023

Via

Access Paper or Ask Questions

Addressing caveats of neural persistence with deep graph persistence

Jul 20, 2023

Leander Girrbach, Anders Christensen, Ole Winther, Zeynep Akata, A. Sophia Koepke

Figure 1 for Addressing caveats of neural persistence with deep graph persistence

Figure 2 for Addressing caveats of neural persistence with deep graph persistence

Figure 3 for Addressing caveats of neural persistence with deep graph persistence

Figure 4 for Addressing caveats of neural persistence with deep graph persistence

Abstract:Neural Persistence is a prominent measure for quantifying neural network complexity, proposed in the emerging field of topological data analysis in deep learning. In this work, however, we find both theoretically and empirically that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. Whilst this captures useful information for linear classifiers, we find that no relevant spatial structure is present in later layers of deep neural networks, making neural persistence roughly equivalent to the variance of weights. Additionally, the proposed averaging procedure across layers for deep neural networks does not consider interaction between layers. Based on our analysis, we propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers, which is equivalent to calculating neural persistence on one particular matrix. This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues through standardisation. Code is available at https://github.com/ExplainableML/Deep-Graph-Persistence .

Via

Access Paper or Ask Questions

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

Jun 12, 2023

Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

Figure 1 for Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

Figure 2 for Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

Figure 3 for Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

Figure 4 for Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

Abstract:The visual classification performance of vision-language models such as CLIP can benefit from additional semantic knowledge, e.g. via large language models (LLMs) such as GPT-3. Further extending classnames with LLM-generated class descriptors, e.g. ``waffle, \textit{which has a round shape}'', or averaging retrieval scores over multiple such descriptors, has been shown to improve generalization performance. In this work, we study this behavior in detail and propose \texttt{Waffle}CLIP, a framework for zero-shot visual classification which achieves similar performance gains on a large number of visual classification tasks by simply replacing LLM-generated descriptors with random character and word descriptors \textbf{without} querying external models. We extend these results with an extensive experimental study on the impact and shortcomings of additional semantics introduced via LLM-generated descriptors, and showcase how semantic context is better leveraged by automatically querying LLMs for high-level concepts, while jointly resolving potential class name ambiguities. Link to the codebase: https://github.com/ExplainableML/WaffleCLIP.

Via

Access Paper or Ask Questions