Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alessandro Conti

On Large Multimodal Models as Open-World Image Classifiers

Mar 27, 2025

Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yiming Wang, Paolo Rota, Elisa Ricci

Abstract:Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.

* 23 pages, 13 figures, code is available at https://github.com/altndrr/lmms-owc

Via

Access Paper or Ask Questions

Compositional Caching for Training-free Open-vocabulary Attribute Detection

Mar 24, 2025

Marco Garosi, Alessandro Conti, Gaowen Liu, Elisa Ricci, Massimiliano Mancini

Abstract:Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address open-vocabulary attribute detection.

* CVPR 2025. Project website at https://comca-attributes.github.io/

Via

Access Paper or Ask Questions

Automatic benchmarking of large multimodal models via iterative experiment programming

Jun 18, 2024

Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano Mancini, Elisa Ricci

Figure 1 for Automatic benchmarking of large multimodal models via iterative experiment programming

Figure 2 for Automatic benchmarking of large multimodal models via iterative experiment programming

Figure 3 for Automatic benchmarking of large multimodal models via iterative experiment programming

Figure 4 for Automatic benchmarking of large multimodal models via iterative experiment programming

Abstract:Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.

* 31 pages, 6 figures, code is available at https://github.com/altndrr/apex

Via

Access Paper or Ask Questions

Vocabulary-free Image Classification and Semantic Segmentation

Apr 16, 2024

Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

Figure 1 for Vocabulary-free Image Classification and Semantic Segmentation

Figure 2 for Vocabulary-free Image Classification and Semantic Segmentation

Figure 3 for Vocabulary-free Image Classification and Semantic Segmentation

Figure 4 for Vocabulary-free Image Classification and Semantic Segmentation

Abstract:Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.

* Under review, 22 pages, 10 figures, code is available at https://github.com/altndrr/vicss. arXiv admin note: text overlap with arXiv:2306.00917

Via

Access Paper or Ask Questions

Test-Time Zero-Shot Temporal Action Localization

Apr 11, 2024

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

Abstract:Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

* CVPR 2024

Via

Access Paper or Ask Questions

Socially Pertinent Robots in Gerontological Healthcare

Apr 11, 2024

Xavier Alameda-Pineda, Angus Addlesee, Daniel Hernández García, Chris Reinke, Soraya Arias, Federica Arrigoni, Alex Auternaud, Lauriane Blavette, Cigdem Beyan, Luis Gomez Camara(+34 more)

Figure 1 for Socially Pertinent Robots in Gerontological Healthcare

Figure 2 for Socially Pertinent Robots in Gerontological Healthcare

Figure 3 for Socially Pertinent Robots in Gerontological Healthcare

Figure 4 for Socially Pertinent Robots in Gerontological Healthcare

Abstract:Despite the many recent achievements in developing and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilities will be useful and accepted in real-life facilities is yet to be answered. This paper is an attempt to partially answer this question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities. The software architecture, developed during the H2020 SPRING project, together with the experimental protocol, allowed us to evaluate the acceptability (AES) and usability (SUS) with more than 60 end-users. Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions.

Via

Access Paper or Ask Questions

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

Aug 22, 2023

Giacomo Zara, Alessandro Conti, Subhankar Roy, Stéphane Lathuilière, Paolo Rota, Elisa Ricci

Abstract:Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods.

* Accepted at ICCV2023, 14 pages, 7 figures, code is available at https://github.com/giaczara/dallv

Via

Access Paper or Ask Questions

Vocabulary-free Image Classification

Jun 01, 2023

Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

Figure 1 for Vocabulary-free Image Classification

Figure 2 for Vocabulary-free Image Classification

Figure 3 for Vocabulary-free Image Classification

Figure 4 for Vocabulary-free Image Classification

Abstract:Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.

* 17 pages, 8 figures, code is available at https://github.com/altndrr/vic

Via

Access Paper or Ask Questions

Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

Oct 11, 2022

Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

Figure 1 for Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

Figure 2 for Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

Figure 3 for Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

Figure 4 for Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

Abstract:Automatically understanding emotions from visual data is a fundamental task for human behaviour understanding. While models devised for Facial Expression Recognition (FER) have demonstrated excellent performances on many datasets, they often suffer from severe performance degradation when trained and tested on different datasets due to domain shift. In addition, as face images are considered highly sensitive data, the accessibility to large-scale datasets for model training is often denied. In this work, we tackle the above-mentioned problems by proposing the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for FER. Our method exploits self-supervised pretraining to learn good feature representations from the target data and proposes a novel and robust cluster-level pseudo-labelling strategy that accounts for in-cluster statistics. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER, and is on par with methods addressing FER in the UDA setting.

* Accepted at BMVC2022, 13 pages, 4 figures, code is available at https://github.com/altndrr/clup

Via

Access Paper or Ask Questions

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Jul 23, 2022

Riccardo Franceschini, Enrico Fini, Cigdem Beyan, Alessandro Conti, Federica Arrigoni, Elisa Ricci

Figure 1 for Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Figure 2 for Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Figure 3 for Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Figure 4 for Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Abstract:Emotion recognition is involved in several real-world applications. With an increase in available modalities, automatic understanding of emotions is being performed more accurately. The success in Multimodal Emotion Recognition (MER), primarily relies on the supervised learning paradigm. However, data annotation is expensive, time-consuming, and as emotion expression and perception depends on several factors (e.g., age, gender, culture) obtaining labels with a high reliability is hard. Motivated by these, we focus on unsupervised feature learning for MER. We consider discrete emotions, and as modalities text, audio and vision are used. Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature. Our end-to-end feature learning approach has several differences (and advantages) compared to existing MER methods: i) it is unsupervised, so the learning is lack of data labelling cost; ii) it does not require data spatial augmentation, modality alignment, large number of batch size or epochs; iii) it applies data fusion only at inference; and iv) it does not require backbones pre-trained on emotion recognition task. The experiments on benchmark datasets show that our method outperforms several baseline approaches and unsupervised learning methods applied in MER. Particularly, it even surpasses a few supervised MER state-of-the-art.

* Accepted to 26th International Conference on Pattern Recognition (ICPR) 2022

Via

Access Paper or Ask Questions