Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lucile Sassatelli

UniCA, IUF

Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

Apr 15, 2025

Elisa Ancarani, Julie Tores, Lucile Sassatelli, Rémy Sun, Hui-Yin Wu, Frédéric Precioso

Abstract:We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

* 6 pages, 8 Figures

Via

Access Paper or Ask Questions

DiVR: incorporating context from diverse VR scenes for human trajectory prediction

Nov 13, 2024

Franz Franco Gallo, Hui-Yin Wu, Lucile Sassatelli

Figure 1 for DiVR: incorporating context from diverse VR scenes for human trajectory prediction

Figure 2 for DiVR: incorporating context from diverse VR scenes for human trajectory prediction

Figure 3 for DiVR: incorporating context from diverse VR scenes for human trajectory prediction

Figure 4 for DiVR: incorporating context from diverse VR scenes for human trajectory prediction

Abstract:Virtual environments provide a rich and controlled setting for collecting detailed data on human behavior, offering unique opportunities for predicting human trajectories in dynamic scenes. However, most existing approaches have overlooked the potential of these environments, focusing instead on static contexts without considering userspecific factors. Employing the CREATTIVE3D dataset, our work models trajectories recorded in virtual reality (VR) scenes for diverse situations including road-crossing tasks with user interactions and simulated visual impairments. We propose Diverse Context VR Human Motion Prediction (DiVR), a cross-modal transformer based on the Perceiver architecture that integrates both static and dynamic scene context using a heterogeneous graph convolution network. We conduct extensive experiments comparing DiVR against existing architectures including MLP, LSTM, and transformers with gaze and point cloud context. Additionally, we also stress test our model's generalizability across different users, tasks, and scenes. Results show that DiVR achieves higher accuracy and adaptability compared to other models and to static graphs. This work highlights the advantages of using VR datasets for context-aware human trajectory modeling, with potential applications in enhancing user experiences in the metaverse. Our source code is publicly available at https://gitlab.inria.fr/ffrancog/creattive3d-divr-model.

* European Conf. on Computer Vision (ECCV) CV4Metaverse workshop, Sep 2024, Milano, Italy

Via

Access Paper or Ask Questions

Visual Objectification in Films: Towards a New AI Task for Video Interpretation

Jan 24, 2024

Julie Tores, Lucile Sassatelli, Hui-Yin Wu, Clement Bergman, Lea Andolfi, Victor Ecrement, Frederic Precioso, Thierry Devars, Magali Guaresi, Virginie Julliard(+1 more)

Figure 1 for Visual Objectification in Films: Towards a New AI Task for Video Interpretation

Figure 2 for Visual Objectification in Films: Towards a New AI Task for Video Interpretation

Figure 3 for Visual Objectification in Films: Towards a New AI Task for Video Interpretation

Figure 4 for Visual Objectification in Films: Towards a New AI Task for Video Interpretation

Abstract:In film gender studies, the concept of 'male gaze' refers to the way the characters are portrayed on-screen as objects of desire rather than subjects. In this article, we introduce a novel video-interpretation task, to detect character objectification in films. The purpose is to reveal and quantify the usage of complex temporal patterns operated in cinema to produce the cognitive perception of objectification. We introduce the ObyGaze12 dataset, made of 1914 movie clips densely annotated by experts for objectification concepts identified in film studies and psychology. We evaluate recent vision models, show the feasibility of the task and where the challenges remain with concept bottleneck models. Our new dataset and code are made available to the community.

* 12 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions

Revisiting Deep Architectures for Head Motion Prediction in 360° Videos

Nov 26, 2019

Miguel Fabian Romero Rondon, Lucile Sassatelli, Ramon Aparicio Pardo, Frederic Precioso

Figure 1 for Revisiting Deep Architectures for Head Motion Prediction in 360° Videos

Figure 2 for Revisiting Deep Architectures for Head Motion Prediction in 360° Videos

Figure 3 for Revisiting Deep Architectures for Head Motion Prediction in 360° Videos

Figure 4 for Revisiting Deep Architectures for Head Motion Prediction in 360° Videos

Abstract:Head motion prediction is an important problem with 360\degree\ videos, in particular to inform the streaming decisions. Various methods tackling this problem with deep neural networks have been proposed recently. In this article we first show the startling result that all such existing methods, which attempt to benefit both from the history of past positions and knowledge of the video content, perform worse than a simple no-motion baseline. We then propose an LSTM-based architecture which processes the positional information only. It is able to establish state-of-the-art performance and we consider it our position-only baseline. Through a thorough root cause analysis, we first show that the content can indeed inform the head position prediction for horizons longer than 2 to 3s, the trajectory inertia being predominant earlier. We also identify that a sequence-to-sequence auto-regressive framework is crucial to improve the prediction accuracy over longer prediction windows, and that a dedicated recurrent network handling the time series of positions is necessary to reach the performance of the position-only baseline in the early prediction steps. This allows to make the most of the positional information and ground-truth saliency. Finally we show how the level of noise in the estimated saliency impacts the architecture's performance, and we propose a new architecture establishing state-of-the-art performance with estimated saliency, supporting its assets with an ablation study.

Via

Access Paper or Ask Questions