Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Denis Paperno

CIMeC - Center for Mind/Brain Sciences, University of Trento

Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Jun 12, 2025

Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt

Abstract:This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.

* 27 pages, 14 figures. Accepted to ACL 2025

Via

Access Paper or Ask Questions

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

Aug 12, 2024

Yingjin Song, Denis Paperno, Albert Gatt

Abstract:Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.

* 18 pages, 12 figures, accepted by INLG 2024

Via

Access Paper or Ask Questions

Grounded and Well-rounded: A Methodological Approach to the Study of Cross-modal and Cross-lingual Grounding

Oct 18, 2023

Timothee Mickus, Elaine Zosa, Denis Paperno

Abstract:Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems. In this paper, we establish a methodological framework for studying what the effects are - if any - of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between cross-modally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.

* accepted to Findings of EMNLP 2023

Via

Access Paper or Ask Questions

The Scenario Refiner: Grounding subjects in images at the morphological level

Sep 20, 2023

Claudia Tagliaferri, Sofia Axioti, Albert Gatt, Denis Paperno

Abstract:Derivationally related words, such as "runner" and "running", exhibit semantic differences which also elicit different visual scenarios. In this paper, we ask whether Vision and Language (V\&L) models capture such distinctions at the morphological level, using a a new methodology and dataset. We compare the results from V\&L models to human judgements and find that models' predictions differ from those of human participants, in particular displaying a grammatical bias. We further investigate whether the human-model misalignment is related to model architecture. Our methodology, developed on one specific morphological contrast, can be further extended for testing models on capturing other nuanced language features.

* presented at the LIMO workshop (Linguistic Insights from and for Multimodal Language Processing @KONVENS 2023)

Via

Access Paper or Ask Questions

Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations

Jun 04, 2023

Aleksey Tikhonov, Lisa Bylinina, Denis Paperno

Abstract:Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. While different embeddings exhibit different applicability and performance on downstream tasks, little is known about the systematic representation differences attributed to the visual modality. Our paper compares word embeddings from three vision-and-language models (CLIP, OpenCLIP and Multilingual CLIP) and three text-only models, with static (FastText) as well as contextual representations (multilingual BERT; XLM-RoBERTa). This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters. We identify meaning properties and relations that characterize words whose embeddings are most affected by the inclusion of visual modality in the training data; that is, points where visual grounding turns out most important. We find that the effect of visual modality correlates most with denotational semantic properties related to concreteness, but is also detected for several specific semantic classes, as well as for valence, a sentiment-related connotational property of linguistic expressions.

* Accepted for StarSEM 2023

Via

Access Paper or Ask Questions

Towards leveraging latent knowledge and Dialogue context for real-world conversational question answering

Dec 17, 2022

Shaomu Tan, Denis Paperno

Figure 1 for Towards leveraging latent knowledge and Dialogue context for real-world conversational question answering

Figure 2 for Towards leveraging latent knowledge and Dialogue context for real-world conversational question answering

Figure 3 for Towards leveraging latent knowledge and Dialogue context for real-world conversational question answering

Figure 4 for Towards leveraging latent knowledge and Dialogue context for real-world conversational question answering

Abstract:In many real-world scenarios, the absence of external knowledge source like Wikipedia restricts question answering systems to rely on latent internal knowledge in limited dialogue data. In addition, humans often seek answers by asking several questions for more comprehensive information. As the dialog becomes more extensive, machines are challenged to refer to previous conversation rounds to answer questions. In this work, we propose to leverage latent knowledge in existing conversation logs via a neural Retrieval-Reading system, enhanced with a TFIDF-based text summarizer refining lengthy conversational history to alleviate the long context issue. Our experiments show that our Retrieval-Reading system can exploit retrieved background knowledge to generate significantly better answers. The results also indicate that our context summarizer significantly helps both the retriever and the reader by introducing more concise and less noisy contextual information.

Via

Access Paper or Ask Questions

Generating image captions with external encyclopedic knowledge

Oct 10, 2022

Sofia Nikiforova, Tejaswini Deoskar, Denis Paperno, Yoad Winter

Figure 1 for Generating image captions with external encyclopedic knowledge

Figure 2 for Generating image captions with external encyclopedic knowledge

Figure 3 for Generating image captions with external encyclopedic knowledge

Figure 4 for Generating image captions with external encyclopedic knowledge

Abstract:Accurately reporting what objects are depicted in an image is largely a solved problem in automatic caption generation. The next big challenge on the way to truly humanlike captioning is being able to incorporate the context of the image and related real world knowledge. We tackle this challenge by creating an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data. Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base, with their subsequent integration into the captioning pipeline at both the encoding and decoding stages. Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions, and achieves significant improvements over multiple baselines. We empirically demonstrate that our approach is effective for generating contextualized captions with encyclopedic knowledge that is both factually accurate and relevant to the image.

Via

Access Paper or Ask Questions

How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Jun 07, 2022

Timothee Mickus, Denis Paperno, Mathieu Constant

Figure 1 for How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Figure 2 for How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Figure 3 for How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Figure 4 for How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Abstract:Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.

* Accepted at TACL (pre-MIT Press publication version)

Via

Access Paper or Ask Questions

Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings

May 27, 2022

Timothee Mickus, Kees van Deemter, Mathieu Constant, Denis Paperno

Figure 1 for Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings

Figure 2 for Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings

Figure 3 for Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings

Figure 4 for Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings

Abstract:Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries. This problem naturally divides into two subtasks: converting definitions into embeddings, and converting embeddings into definitions. This task was conducted in a multilingual setting, using comparable sets of embeddings trained homogeneously.

Via

Access Paper or Ask Questions

A Game Interface to Study Semantic Grounding in Text-Based Models

Aug 17, 2021

Timothee Mickus, Mathieu Constant, Denis Paperno

Figure 1 for A Game Interface to Study Semantic Grounding in Text-Based Models

Figure 2 for A Game Interface to Study Semantic Grounding in Text-Based Models

Figure 3 for A Game Interface to Study Semantic Grounding in Text-Based Models

Abstract:Can language models learn grounded representations from text distribution alone? This question is both central and recurrent in natural language processing; authors generally agree that grounding requires more than textual distribution. We propose to experimentally test this claim: if any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is out of the reach of text-based models. To that end, we present early work on an online game for the collection of human judgments on the distributional similarity of word pairs in five languages. We further report early results of our data collection campaign.

Via

Access Paper or Ask Questions