Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cristina Luna-Jiménez

Video Joint-Embedding Predictive Architectures for Facial Expression Recognition

Jan 14, 2026

Lennart Eing, Cristina Luna-Jiménez, Silvan Mertes, Elisabeth André

Abstract:This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.

* To appear in 2025 Proceedings of the 13th International Conference on Affective Computing and Intelligent Interaction (ACII), submitted to IEEE. \c{opyright} 2025 IEEE

Via

Access Paper or Ask Questions

Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

Oct 19, 2022

Ricardo Kleinlein, Cristina Luna-Jiménez, Fernando Fernández-Martínez

Figure 1 for Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

Figure 2 for Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

Figure 3 for Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

Figure 4 for Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

Abstract:The impressive capacity shown by recent text-to-image diffusion models to generate high-quality pictures from textual input prompts has leveraged the debate about the very definition of art. Nonetheless, these models have been trained using text data collected from content-based labelling protocols that focus on describing the items and actions in an image but neglect any subjective appraisal. Consequently, these automatic systems need rigorous descriptions of the elements and the pictorial style of the image to be generated, otherwise failing to deliver. As potential indicators of the actual artistic capabilities of current generative models, we characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models. Considering the sharp difference observed between their language style and that typically employed in artistic contexts, we suggest generative models should incorporate additional sources of subjective information in their training in order to overcome (or at least to alleviate) some of their current limitations, thus effectively unleashing a truly artistic and creative generation.

* NeurIPS 2022 Machine Learning for Creativity and Design Workshop

Via

Access Paper or Ask Questions