Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Angelo Cangelosi

Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects

Jun 24, 2025

Federico Tavella, Kathryn Mearns, Angelo Cangelosi

Abstract:Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.

Via

Access Paper or Ask Questions

Joint Action Language Modelling for Transparent Policy Execution

Apr 14, 2025

Theodor Wulff, Rahul Singh Maharjan, Xinyun Chi, Angelo Cangelosi

Figure 1 for Joint Action Language Modelling for Transparent Policy Execution

Figure 2 for Joint Action Language Modelling for Transparent Policy Execution

Figure 3 for Joint Action Language Modelling for Transparent Policy Execution

Figure 4 for Joint Action Language Modelling for Transparent Policy Execution

Abstract:An agent's intention often remains hidden behind the black-box nature of embodied policies. Communication using natural language statements that describe the next action can provide transparency towards the agent's behavior. We aim to insert transparent behavior directly into the learning process, by transforming the problem of policy learning into a language generation problem and combining it with traditional autoregressive modelling. The resulting model produces transparent natural language statements followed by tokens representing the specific actions to solve long-horizon tasks in the Language-Table environment. Following previous work, the model is able to learn to produce a policy represented by special discretized tokens in an autoregressive manner. We place special emphasis on investigating the relationship between predicting actions and producing high-quality language for a transparent agent. We find that in many cases both the quality of the action trajectory and the transparent statement increase when they are generated simultaneously.

Via

Access Paper or Ask Questions

Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

Apr 14, 2025

Tien Pham, Angelo Cangelosi

Figure 1 for Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

Figure 2 for Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

Figure 3 for Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

Figure 4 for Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

Abstract:Current approaches in Explainable Deep Reinforcement Learning have limitations in which the attention mask has a displacement with the objects in visual input. This work addresses a spatial problem within traditional Convolutional Neural Networks (CNNs). We propose the Interpretable Feature Extractor (IFE) architecture, aimed at generating an accurate attention mask to illustrate both "what" and "where" the agent concentrates on in the spatial domain. Our design incorporates a Human-Understandable Encoding module to generate a fully interpretable attention mask, followed by an Agent-Friendly Encoding module to enhance the agent's learning efficiency. These two components together form the Interpretable Feature Extractor for vision-based deep reinforcement learning to enable the model's interpretability. The resulting attention mask is consistent, highly understandable by humans, accurate in spatial dimension, and effectively highlights important objects or locations in visual input. The Interpretable Feature Extractor is integrated into the Fast and Data-efficient Rainbow framework, and evaluated on 57 ATARI games to show the effectiveness of the proposed approach on Spatial Preservation, Interpretability, and Data-efficiency. Finally, we showcase the versatility of our approach by incorporating the IFE into the Asynchronous Advantage Actor-Critic Model.

Via

Access Paper or Ask Questions

Attributes-aware Visual Emotion Representation Learning

Apr 09, 2025

Rahul Singh Maharjan, Marta Romeo, Angelo Cangelosi

Figure 1 for Attributes-aware Visual Emotion Representation Learning

Figure 2 for Attributes-aware Visual Emotion Representation Learning

Figure 3 for Attributes-aware Visual Emotion Representation Learning

Figure 4 for Attributes-aware Visual Emotion Representation Learning

Abstract:Visual emotion analysis or recognition has gained considerable attention due to the growing interest in understanding how images can convey rich semantics and evoke emotions in human perception. However, visual emotion analysis poses distinctive challenges compared to traditional vision tasks, especially due to the intricate relationship between general visual features and the different affective states they evoke, known as the affective gap. Researchers have used deep representation learning methods to address this challenge of extracting generalized features from entire images. However, most existing methods overlook the importance of specific emotional attributes such as brightness, colorfulness, scene understanding, and facial expressions. Through this paper, we introduce A4Net, a deep representation network to bridge the affective gap by leveraging four key attributes: brightness (Attribute 1), colorfulness (Attribute 2), scene context (Attribute 3), and facial expressions (Attribute 4). By fusing and jointly training all aspects of attribute recognition and visual emotion analysis, A4Net aims to provide a better insight into emotional content in images. Experimental results show the effectiveness of A4Net, showcasing competitive performance compared to state-of-the-art methods across diverse visual emotion datasets. Furthermore, visualizations of activation maps generated by A4Net offer insights into its ability to generalize across different visual emotion datasets.

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

Towards Responsible AI Music: an Investigation of Trustworthy Features for Creative Systems

Mar 24, 2025

Jacopo de Berardinis, Lorenzo Porcaro, Albert Meroño-Peñuela, Angelo Cangelosi, Tess Buckley

Abstract:Generative AI is radically changing the creative arts, by fundamentally transforming the way we create and interact with cultural artefacts. While offering unprecedented opportunities for artistic expression and commercialisation, this technology also raises ethical, societal, and legal concerns. Key among these are the potential displacement of human creativity, copyright infringement stemming from vast training datasets, and the lack of transparency, explainability, and fairness mechanisms. As generative systems become pervasive in this domain, responsible design is crucial. Whilst previous work has tackled isolated aspects of generative systems (e.g., transparency, evaluation, data), we take a comprehensive approach, grounding these efforts within the Ethics Guidelines for Trustworthy Artificial Intelligence produced by the High-Level Expert Group on AI appointed by the European Commission - a framework for designing responsible AI systems across seven macro requirements. Focusing on generative music AI, we illustrate how these requirements can be contextualised for the field, addressing trustworthiness across multiple dimensions and integrating insights from the existing literature. We further propose a roadmap for operationalising these contextualised requirements, emphasising interdisciplinary collaboration and stakeholder engagement. Our work provides a foundation for designing and evaluating responsible music generation systems, calling for collaboration among AI experts, ethicists, legal scholars, and artists. This manuscript is accompanied by a website: https://amresearchlab.github.io/raim-framework/.

Via

Access Paper or Ask Questions

The ATTUNE model for Artificial Trust Towards Human Operators

Nov 29, 2024

Giannis Petousakis, Angelo Cangelosi, Rustam Stolkin, Manolis Chiou

Figure 1 for The ATTUNE model for Artificial Trust Towards Human Operators

Figure 2 for The ATTUNE model for Artificial Trust Towards Human Operators

Figure 3 for The ATTUNE model for Artificial Trust Towards Human Operators

Figure 4 for The ATTUNE model for Artificial Trust Towards Human Operators

Abstract:This paper presents a novel method to quantify Trust in HRI. It proposes an HRI framework for estimating the Robot Trust towards the Human in the context of a narrow and specified task. The framework produces a real-time estimation of an AI agent's Artificial Trust towards a Human partner interacting with a mobile teleoperation robot. The approach for the framework is based on principles drawn from Theory of Mind, including information about the human state, action, and intent. The framework creates the ATTUNE model for Artificial Trust Towards Human Operators. The model uses metrics on the operator's state of attention, navigational intent, actions, and performance to quantify the Trust towards them. The model is tested on a pre-existing dataset that includes recordings (ROSbags) of a human trial in a simulated disaster response scenario. The performance of ATTUNE is evaluated through a qualitative and quantitative analysis. The results of the analyses provide insight into the next stages of the research and help refine the proposed approach.

* Published in IEEE SMC 2024
* Published in IEEE SMC 2024

Via

Access Paper or Ask Questions

From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

Oct 03, 2024

Haodong Xie, Rahul Singh Maharjan, Federico Tavella, Angelo Cangelosi

Figure 1 for From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

Figure 2 for From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

Figure 3 for From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

Figure 4 for From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

Abstract:Understanding and manipulating concrete and abstract concepts is fundamental to human intelligence. Yet, they remain challenging for artificial agents. This paper introduces a multimodal generative approach to high order abstract concept learning, which integrates visual and categorical linguistic information from concrete ones. Our model initially grounds subordinate level concrete concepts, combines them to form basic level concepts, and finally abstracts to superordinate level concepts via the grounding of basic-level concepts. We evaluate the model language learning ability through language-to-visual and visual-to-language tests with high order abstract concepts. Experimental results demonstrate the proficiency of the model in both language understanding and language naming tasks.

Via

Access Paper or Ask Questions

Noise-Free Explanation for Driving Action Prediction

Jul 08, 2024

Hongbo Zhu, Theodor Wulff, Rahul Singh Maharjan, Jinpei Han, Angelo Cangelosi

Abstract:Although attention mechanisms have achieved considerable progress in Transformer-based architectures across various Artificial Intelligence (AI) domains, their inner workings remain to be explored. Existing explainable methods have different emphases but are rather one-sided. They primarily analyse the attention mechanisms or gradient-based attribution while neglecting the magnitudes of input feature values or the skip-connection module. Moreover, they inevitably bring spurious noisy pixel attributions unrelated to the model's decision, hindering humans' trust in the spotted visualization result. Hence, we propose an easy-to-implement but effective way to remedy this flaw: Smooth Noise Norm Attention (SNNA). We weigh the attention by the norm of the transformed value vector and guide the label-specific signal with the attention gradient, then randomly sample the input perturbations and average the corresponding gradients to produce noise-free attribution. Instead of evaluating the explanation method on the binary or multi-class classification tasks like in previous works, we explore the more complex multi-label classification scenario in this work, i.e., the driving action prediction task, and trained a model for it specifically. Both qualitative and quantitative evaluation results show the superiority of SNNA compared to other SOTA attention-based explainable methods in generating a clearer visual explanation map and ranking the input pixel importance.

* 10 pages, 10 figures

Via

Access Paper or Ask Questions

Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

Jun 14, 2024

Federico Tavella, Aphrodite Galata, Angelo Cangelosi

Figure 1 for Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

Figure 2 for Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

Figure 3 for Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

Figure 4 for Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

Abstract:Artificial agents, particularly humanoid robots, interact with their environment, objects, and people using cameras, actuators, and physical presence. Their communication methods are often pre-programmed, limiting their actions and interactions. Our research explores acquiring non-verbal communication skills through learning from demonstrations, with potential applications in sign language comprehension and expression. In particular, we focus on imitation learning for artificial agents, exemplified by teaching a simulated humanoid American Sign Language. We use computer vision and deep learning to extract information from videos, and reinforcement learning to enable the agent to replicate observed actions. Compared to other methods, our approach eliminates the need for additional hardware to acquire information. We demonstrate how the combination of these different techniques offers a viable way to learn sign language. Our methodology successfully teaches 5 different signs involving the upper body (i.e., arms and hands). This research paves the way for advanced communication skills in artificial agents.

Via

Access Paper or Ask Questions

ToP-ToM: Trust-aware Robot Policy with Theory of Mind

Nov 07, 2023

Chuang Yu, Baris Serhan, Angelo Cangelosi

Figure 1 for ToP-ToM: Trust-aware Robot Policy with Theory of Mind

Figure 2 for ToP-ToM: Trust-aware Robot Policy with Theory of Mind

Figure 3 for ToP-ToM: Trust-aware Robot Policy with Theory of Mind

Figure 4 for ToP-ToM: Trust-aware Robot Policy with Theory of Mind

Abstract:Theory of Mind (ToM) is a fundamental cognitive architecture that endows humans with the ability to attribute mental states to others. Humans infer the desires, beliefs, and intentions of others by observing their behavior and, in turn, adjust their actions to facilitate better interpersonal communication and team collaboration. In this paper, we investigated trust-aware robot policy with the theory of mind in a multiagent setting where a human collaborates with a robot against another human opponent. We show that by only focusing on team performance, the robot may resort to the reverse psychology trick, which poses a significant threat to trust maintenance. The human's trust in the robot will collapse when they discover deceptive behavior by the robot. To mitigate this problem, we adopt the robot theory of mind model to infer the human's trust beliefs, including true belief and false belief (an essential element of ToM). We designed a dynamic trust-aware reward function based on different trust beliefs to guide the robot policy learning, which aims to balance between avoiding human trust collapse due to robot reverse psychology. The experimental results demonstrate the importance of the ToM-based robot policy for human-robot trust and the effectiveness of our robot ToM-based robot policy in multiagent interaction settings.

* 7 pages

Via

Access Paper or Ask Questions