Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Agata Lapedriza

Universitat Oberta de Catalunya

Experimenting with Affective Computing Models in Video Interviews with Spanish-speaking Older Adults

Jan 28, 2025

Josep Lopez Camunas, Cristina Bustos, Yanjun Zhu, Raquel Ros, Agata Lapedriza

Figure 1 for Experimenting with Affective Computing Models in Video Interviews with Spanish-speaking Older Adults

Figure 2 for Experimenting with Affective Computing Models in Video Interviews with Spanish-speaking Older Adults

Figure 3 for Experimenting with Affective Computing Models in Video Interviews with Spanish-speaking Older Adults

Figure 4 for Experimenting with Affective Computing Models in Video Interviews with Spanish-speaking Older Adults

Abstract:Understanding emotional signals in older adults is crucial for designing virtual assistants that support their well-being. However, existing affective computing models often face significant limitations: (1) limited availability of datasets representing older adults, especially in non-English-speaking populations, and (2) poor generalization of models trained on younger or homogeneous demographics. To address these gaps, this study evaluates state-of-the-art affective computing models -- including facial expression recognition, text sentiment analysis, and smile detection -- using videos of older adults interacting with either a person or a virtual avatar. As part of this effort, we introduce a novel dataset featuring Spanish-speaking older adults engaged in human-to-human video interviews. Through three comprehensive analyses, we investigate (1) the alignment between human-annotated labels and automatic model outputs, (2) the relationships between model outputs across different modalities, and (3) individual variations in emotional signals. Using both the Wizard of Oz (WoZ) dataset and our newly collected dataset, we uncover limited agreement between human annotations and model predictions, weak consistency across modalities, and significant variability among individuals. These findings highlight the shortcomings of generalized emotion perception models and emphasize the need of incorporating personal variability and cultural nuances into future systems.

Via

Access Paper or Ask Questions

Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey

Aug 04, 2024

Shiran Dudy, Ibrahim Said Ahmad, Ryoko Kitajima, Agata Lapedriza

Abstract:Large Language Models (LLMs) have gained widespread global adoption, showcasing advanced linguistic capabilities across multiple of languages. There is a growing interest in academia to use these models to simulate and study human behaviors. However, it is crucial to acknowledge that an LLM's proficiency in a specific language might not fully encapsulate the norms and values associated with its culture. Concerns have emerged regarding potential biases towards Anglo-centric cultures and values due to the predominance of Western and US-based training data. This study focuses on analyzing the cultural representations of emotions in LLMs, in the specific case of mixed-emotion situations. Our methodology is based on the studies of Miyamoto et al. (2010), which identified distinctive emotional indicators in Japanese and American human responses. We first administer their mixed emotion survey to five different LLMs and analyze their outputs. Second, we experiment with contextual variables to explore variations in responses considering both language and speaker origin. Thirdly, we expand our investigation to encompass additional East Asian and Western European origin languages to gauge their alignment with their respective cultures, anticipating a closer fit. We find that (1) models have limited alignment with the evidence in the literature; (2) written language has greater effect on LLMs' response than information on participants origin; and (3) LLMs responses were found more similar for East Asian languages than Western European languages.

* Was accepted to ACII 2024

Via

Access Paper or Ask Questions

Analyzing the contribution of different passively collected data to predict Stress and Depression

Oct 20, 2023

Irene Bonafonte, Cristina Bustos, Abraham Larrazolo, Gilberto Lorenzo Martinez Luna, Adolfo Guzman Arenas, Xavier Baro, Isaac Tourgeman, Mercedes Balcells, Agata Lapedriza

Figure 1 for Analyzing the contribution of different passively collected data to predict Stress and Depression

Abstract:The possibility of recognizing diverse aspects of human behavior and environmental context from passively captured data motivates its use for mental health assessment. In this paper, we analyze the contribution of different passively collected sensor data types (WiFi, GPS, Social interaction, Phone Log, Physical Activity, Audio, and Academic features) to predict daily selfreport stress and PHQ-9 depression score. First, we compute 125 mid-level features from the original raw data. These 125 features include groups of features from the different sensor data types. Then, we evaluate the contribution of each feature type by comparing the performance of Neural Network models trained with all features against Neural Network models trained with specific feature groups. Our results show that WiFi features (which encode mobility patterns) and Phone Log features (which encode information correlated with sleep patterns), provide significative information for stress and depression prediction.

Via

Access Paper or Ask Questions

On the use of Vision-Language models for Visual Sentiment Analysis: a study on CLIP

Oct 18, 2023

Cristina Bustos, Carles Civit, Brian Du, Albert Sole-Ribalta, Agata Lapedriza

Abstract:This work presents a study on how to exploit the CLIP embedding space to perform Visual Sentiment Analysis. We experiment with two architectures built on top of the CLIP embedding space, which we denote by CLIP-E. We train the CLIP-E models with WEBEmo, the largest publicly available and manually labeled benchmark for Visual Sentiment Analysis, and perform two sets of experiments. First, we test on WEBEmo and compare the CLIP-E architectures with state-of-the-art (SOTA) models and with CLIP Zero-Shot. Second, we perform cross dataset evaluation, and test the CLIP-E architectures trained with WEBEmo on other Visual Sentiment Analysis benchmarks. Our results show that the CLIP-E approaches outperform SOTA models in WEBEmo fine grained categorization, and they also generalize better when tested on datasets that have not been seen during training. Interestingly, we observed that for the FI dataset, CLIP Zero-Shot produces better accuracies than SOTA models and CLIP-E trained on WEBEmo. These results motivate several questions that we discuss in this paper, such as how we should design new benchmarks and evaluate Visual Sentiment Analysis, and whether we should keep designing tailored Deep Learning models for Visual Sentiment Analysis or focus our efforts on better using the knowledge encoded in large vision-language models such as CLIP for this task.

Via

Access Paper or Ask Questions

Local Relighting of Real Scenes

Jul 06, 2022

Audrey Cui, Ali Jahanian, Agata Lapedriza, Antonio Torralba, Shahin Mahdizadehaghdam, Rohit Kumar, David Bau

Figure 1 for Local Relighting of Real Scenes

Figure 2 for Local Relighting of Real Scenes

Figure 3 for Local Relighting of Real Scenes

Figure 4 for Local Relighting of Real Scenes

Abstract:We introduce the task of local relighting, which changes a photograph of a scene by switching on and off the light sources that are visible within the image. This new task differs from the traditional image relighting problem, as it introduces the challenge of detecting light sources and inferring the pattern of light that emanates from them. We propose an approach for local relighting that trains a model without supervision of any novel image dataset by using synthetically generated image pairs from another model. Concretely, we collect paired training images from a stylespace-manipulated GAN; then we use these images to train a conditional image-to-image model. To benchmark local relighting, we introduce Lonoff, a collection of 306 precisely aligned images taken in indoor spaces with different combinations of lights switched on. We show that our method significantly outperforms baseline methods based on GAN inversion. Finally, we demonstrate extensions of our method that control different light sources separately. We invite the community to tackle this new task of local relighting.

* 15 pages, 15 figures

Via

Access Paper or Ask Questions

Predicting the impact of urban change in pedestrian and road safety

Feb 03, 2022

Cristina Bustos, Daniel Rhoads, Agata Lapedriza, Javier Borge-Holthoefer, Albert Solé-Ribalta

Figure 1 for Predicting the impact of urban change in pedestrian and road safety

Figure 2 for Predicting the impact of urban change in pedestrian and road safety

Figure 3 for Predicting the impact of urban change in pedestrian and road safety

Figure 4 for Predicting the impact of urban change in pedestrian and road safety

Abstract:Increased interaction between and among pedestrians and vehicles in the crowded urban environments of today gives rise to a negative side-effect: a growth in traffic accidents, with pedestrians being the most vulnerable elements. Recent work has shown that Convolutional Neural Networks are able to accurately predict accident rates exploiting Street View imagery along urban roads. The promising results point to the plausibility of aided design of safe urban landscapes, for both pedestrians and vehicles. In this paper, by considering historical accident data and Street View images, we detail how to automatically predict the impact (increase or decrease) of urban interventions on accident incidence. The results are positive, rendering an accuracies ranging from 60 to 80%. We additionally provide an interpretability analysis to unveil which specific categories of urban features impact accident rates positively or negatively. Considering the transportation network substrates (sidewalk and road networks) and their demand, we integrate these results to a complex network framework, to estimate the effective impact of urban change on the safety of pedestrians and vehicles. Results show that public authorities may leverage on machine learning tools to prioritize targeted interventions, since our analysis show that limited improvement is obtained with current tools. Further, our findings have a wider application range such as the design of safe urban routes for pedestrians or to the field of driver-assistance technologies.

Via

Access Paper or Ask Questions

Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents

Jan 11, 2022

Ethan Weber, Dim P. Papadopoulos, Agata Lapedriza, Ferda Ofli, Muhammad Imran, Antonio Torralba

Figure 1 for Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents

Figure 2 for Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents

Figure 3 for Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents

Figure 4 for Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents

Abstract:Natural disasters, such as floods, tornadoes, or wildfires, are increasingly pervasive as the Earth undergoes global warming. It is difficult to predict when and where an incident will occur, so timely emergency response is critical to saving the lives of those endangered by destructive events. Fortunately, technology can play a role in these situations. Social media posts can be used as a low-latency data source to understand the progression and aftermath of a disaster, yet parsing this data is tedious without automated methods. Prior work has mostly focused on text-based filtering, yet image and video-based filtering remains largely unexplored. In this work, we present the Incidents1M Dataset, a large-scale multi-label dataset which contains 977,088 images, with 43 incident and 49 place categories. We provide details of the dataset construction, statistics and potential biases; introduce and train a model for incident detection; and perform image-filtering experiments on millions of images on Flickr and Twitter. We also present some applications on incident analysis to encourage and enable future work in computer vision for humanitarian aid. Code, data, and models are available at http://incidentsdataset.csail.mit.edu.

Via

Access Paper or Ask Questions

Explainable, automated urban interventions to improve pedestrian and vehicle safety

Nov 08, 2021

Cristina Bustos, Daniel Rhoads, Albert Sole-Ribalta, David Masip, Alex Arenas, Agata Lapedriza, Javier Borge-Holthoefer

Figure 1 for Explainable, automated urban interventions to improve pedestrian and vehicle safety

Figure 2 for Explainable, automated urban interventions to improve pedestrian and vehicle safety

Figure 3 for Explainable, automated urban interventions to improve pedestrian and vehicle safety

Abstract:At the moment, urban mobility research and governmental initiatives are mostly focused on motor-related issues, e.g. the problems of congestion and pollution. And yet, we can not disregard the most vulnerable elements in the urban landscape: pedestrians, exposed to higher risks than other road users. Indeed, safe, accessible, and sustainable transport systems in cities are a core target of the UN's 2030 Agenda. Thus, there is an opportunity to apply advanced computational tools to the problem of traffic safety, in regards especially to pedestrians, who have been often overlooked in the past. This paper combines public data sources, large-scale street imagery and computer vision techniques to approach pedestrian and vehicle safety with an automated, relatively simple, and universally-applicable data-processing scheme. The steps involved in this pipeline include the adaptation and training of a Residual Convolutional Neural Network to determine a hazard index for each given urban scene, as well as an interpretability analysis based on image segmentation and class activation mapping on those same images. Combined, the outcome of this computational approach is a fine-grained map of hazard levels across a city, and an heuristic to identify interventions that might simultaneously improve pedestrian and vehicle safety. The proposed framework should be taken as a complement to the work of urban planners and public authorities.

Via

Access Paper or Ask Questions

Predicting Driver Self-Reported Stress by Analyzing the Road Scene

Sep 27, 2021

Cristina Bustos, Neska Elhaouij, Albert Sole-Ribalta, Javier Borge-Holthoefer, Agata Lapedriza, Rosalind Picard

Figure 1 for Predicting Driver Self-Reported Stress by Analyzing the Road Scene

Figure 2 for Predicting Driver Self-Reported Stress by Analyzing the Road Scene

Figure 3 for Predicting Driver Self-Reported Stress by Analyzing the Road Scene

Figure 4 for Predicting Driver Self-Reported Stress by Analyzing the Road Scene

Abstract:Several studies have shown the relevance of biosignals in driver stress recognition. In this work, we examine something important that has been less frequently explored: We develop methods to test if the visual driving scene can be used to estimate a drivers' subjective stress levels. For this purpose, we use the AffectiveROAD video recordings and their corresponding stress labels, a continuous human-driver-provided stress metric. We use the common class discretization for stress, dividing its continuous values into three classes: low, medium, and high. We design and evaluate three computer vision modeling approaches to classify the driver's stress levels: (1) object presence features, where features are computed using automatic scene segmentation; (2) end-to-end image classification; and (3) end-to-end video classification. All three approaches show promising results, suggesting that it is possible to approximate the drivers' subjective stress from the information found in the visual scene. We observe that the video classification, which processes the temporal information integrated with the visual information, obtains the highest accuracy of $0.72$, compared to a random baseline accuracy of $0.33$ when tested on a set of nine drivers.

Via

Access Paper or Ask Questions

Interpreting Face Inference Models using Hierarchical Network Dissection

Aug 23, 2021

Divyang Teotia, Agata Lapedriza, Sarah Ostadabbas

Figure 1 for Interpreting Face Inference Models using Hierarchical Network Dissection

Figure 2 for Interpreting Face Inference Models using Hierarchical Network Dissection

Figure 3 for Interpreting Face Inference Models using Hierarchical Network Dissection

Figure 4 for Interpreting Face Inference Models using Hierarchical Network Dissection

Abstract:This paper presents Hierarchical Network Dissection, a general pipeline to interpret the internal representation of face-centric inference models. Using a probabilistic formulation, Hierarchical Network Dissection pairs units of the model with concepts in our "Face Dictionary" (a collection of facial concepts with corresponding sample images). Our pipeline is inspired by Network Dissection, a popular interpretability model for object-centric and scene-centric models. However, our formulation allows to deal with two important challenges of face-centric models that Network Dissection cannot address: (1) spacial overlap of concepts: there are different facial concepts that simultaneously occur in the same region of the image, like "nose" (facial part) and "pointy nose" (facial attribute); and (2) global concepts: there are units with affinity to concepts that do not refer to specific locations of the face (e.g. apparent age). To validate the effectiveness of our unit-concept pairing formulation, we first conduct controlled experiments on biased data. These experiments illustrate how Hierarchical Network Dissection can be used to discover bias in the training data. Then, we dissect different face-centric inference models trained on widely-used facial datasets. The results show models trained for different tasks have different internal representations. Furthermore, the interpretability results reveal some biases in the training data and some interesting characteristics of the face-centric inference tasks.

Via

Access Paper or Ask Questions