Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dominique Vaufreydaz

M-PSI

Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Sep 24, 2024

Sotheara Leang, Anderson Augusma, Eric Castelli, Frédérique Letué, Sethserey Sam, Dominique Vaufreydaz

Figure 1 for Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Figure 2 for Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Figure 3 for Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Figure 4 for Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Abstract:Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.

* Voice Privacy Challenge 2024 at INTERSPEECH 2024, Sep 2024, KOS Island, Greece

Via

Access Paper or Ask Questions

Generative Resident Separation and Multi-label Classification for Multi-person Activity Recognition

Apr 10, 2024

Xi Chen, Julien Cumin, Fano Ramparany, Dominique Vaufreydaz

Abstract:This paper presents two models to address the problem of multi-person activity recognition using ambient sensors in a home. The first model, Seq2Res, uses a sequence generation approach to separate sensor events from different residents. The second model, BiGRU+Q2L, uses a Query2Label multi-label classifier to predict multiple activities simultaneously. Performances of these models are compared to a state-of-the-art model in different experimental scenarios, using a state-of-the-art dataset of two residents in a home instrumented with ambient sensors. These results lead to a discussion on the advantages and drawbacks of resident separation and multi-label classification for multi-person activity recognition.

* Context and Activity Modeling and Recognition (CoMoReA) Workshop at IEEE International Conference on Pervasive Computing and Communications (PerCom 2024), Mar 2024, Biarritz, France

Via

Access Paper or Ask Questions

Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features

Dec 06, 2023

Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué

Abstract:This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.

* ICMI '23: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, Oct 2023, Paris, France. pp.750-754

Via

Access Paper or Ask Questions

A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Jul 04, 2023

Louis Airale, Dominique Vaufreydaz, Xavier Alameda-Pineda

Figure 1 for A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Figure 2 for A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Figure 3 for A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Figure 4 for A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Abstract:Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.

Via

Access Paper or Ask Questions

Preliminary Study on SSCF-derived Polar Coordinate for ASR

Nov 30, 2022

Sotheara Leang, Eric Castelli, Dominique Vaufreydaz, Sethserey Sam

Figure 1 for Preliminary Study on SSCF-derived Polar Coordinate for ASR

Figure 2 for Preliminary Study on SSCF-derived Polar Coordinate for ASR

Figure 3 for Preliminary Study on SSCF-derived Polar Coordinate for ASR

Figure 4 for Preliminary Study on SSCF-derived Polar Coordinate for ASR

Abstract:The transition angles are defined to describe the vowel-to-vowel transitions in the acoustic space of the Spectral Subband Centroids, and the findings show that they are similar among speakers and speaking rates. In this paper, we propose to investigate the usage of polar coordinates in favor of angles to describe a speech signal by characterizing its acoustic trajectory and using them in Automatic Speech Recognition. According to the experimental results evaluated on the BRAF100 dataset, the polar coordinates achieved significantly higher accuracy than the angles in the mixed and cross-gender speech recognitions, demonstrating that these representations are superior at defining the acoustic trajectory of the speech signal. Furthermore, the accuracy was significantly improved when they were utilized with their first and second-order derivatives ($\Delta$, $\Delta$$\Delta$), especially in cross-female recognition. However, the results showed they were not much more gender-independent than the conventional Mel-frequency Cepstral Coefficients (MFCCs).

* ACET 2022, Dec 2022, Phnom Penh, Cambodia

Via

Access Paper or Ask Questions

Autoregressive GAN for Semantic Unconditional Head Motion Generation

Nov 02, 2022

Louis Airale, Xavier Alameda-Pineda, Stéphane Lathuilière, Dominique Vaufreydaz

Abstract:We address the task of unconditional head motion generation to animate still human faces in a low-dimensional semantic space.Deviating from talking head generation conditioned on audio that seldom puts emphasis on realistic head motions, we devise a GAN-based architecture that allows obtaining rich head motion sequences while avoiding known caveats associated with GANs.Namely, the autoregressive generation of incremental outputs ensures smooth trajectories, while a multi-scale discriminator on input pairs drives generation toward better handling of high and low frequency signals and less mode collapse.We demonstrate experimentally the relevance of the proposed architecture and compare with models that showed state-of-the-art performances on similar tasks.

Via

Access Paper or Ask Questions

Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Mar 24, 2022

Yangtao Wang, Xi Shen, Shell Hu, Yuan Yuan, James Crowley, Dominique Vaufreydaz

Figure 1 for Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Figure 2 for Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Figure 3 for Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Figure 4 for Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Abstract:Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens. Foreground objects can then be segmented using a normalized graph-cut to group self-similar regions. We solve the graph-cut problem using spectral clustering with generalized eigen-decomposition and show that the second smallest eigenvector provides a cutting solution since its absolute value indicates the likelihood that a token belongs to a foreground object. Despite its simplicity, this approach significantly boosts the performance of unsupervised object discovery: we improve over the recent state of the art LOST by a margin of 6.9%, 8.1%, and 8.1% respectively on the VOC07, VOC12, and COCO20K. The performance can be further improved by adding a second stage class-agnostic detector (CAD). Our proposed method can be easily extended to unsupervised saliency detection and weakly supervised object detection. For unsupervised saliency detection, we improve IoU for 4.9%, 5.2%, 12.9% on ECSSD, DUTS, DUT-OMRON respectively compared to previous state of the art. For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.

* CVPR 2022 - Conference on Computer Vision and Pattern Recognition, Jun 2022, New Orleans, United States

Via

Access Paper or Ask Questions

Navigation In Urban Environments Amongst Pedestrians Using Multi-Objective Deep Reinforcement Learning

Oct 11, 2021

Niranjan Deshpande, Dominique Vaufreydaz, Anne Spalanzani

Figure 1 for Navigation In Urban Environments Amongst Pedestrians Using Multi-Objective Deep Reinforcement Learning

Figure 2 for Navigation In Urban Environments Amongst Pedestrians Using Multi-Objective Deep Reinforcement Learning

Abstract:Urban autonomous driving in the presence of pedestrians as vulnerable road users is still a challenging and less examined research problem. This work formulates navigation in urban environments as a multi objective reinforcement learning problem. A deep learning variant of thresholded lexicographic Q-learning is presented for autonomous navigation amongst pedestrians. The multi objective DQN agent is trained on a custom urban environment developed in CARLA simulator. The proposed method is evaluated by comparing it with a single objective DQN variant on known and unknown environments. Evaluation results show that the proposed method outperforms the single objective DQN variant with respect to all aspects.

* 24th IEEE International Conference on Intelligent Transportation Systems - ITSC2021, Sep 2021, Indianapolis, United States

Via

Access Paper or Ask Questions

SocialInteractionGAN: Multi-person Interaction Sequence Generation

Mar 10, 2021

Louis Airale, Dominique Vaufreydaz, Xavier Alameda-Pineda

Figure 1 for SocialInteractionGAN: Multi-person Interaction Sequence Generation

Figure 2 for SocialInteractionGAN: Multi-person Interaction Sequence Generation

Figure 3 for SocialInteractionGAN: Multi-person Interaction Sequence Generation

Figure 4 for SocialInteractionGAN: Multi-person Interaction Sequence Generation

Abstract:Prediction of human actions in social interactions has important applications in the design of social robots or artificial avatars. In this paper, we model human interaction generation as a discrete multi-sequence generation problem and present SocialInteractionGAN, a novel adversarial architecture for conditional interaction generation. Our model builds on a recurrent encoder-decoder generator network and a dual-stream discriminator. This architecture allows the discriminator to jointly assess the realism of interactions and that of individual action sequences. Within each stream a recurrent network operating on short subsequences endows the output signal with local assessments, better guiding the forthcoming generation. Crucially, contextual information on interacting participants is shared among agents and reinjected in both the generation and the discriminator evaluation processes. We show that the proposed SocialInteractionGAN succeeds in producing high realism action sequences of interacting people, comparing favorably to a diversity of recurrent and convolutional discriminator baselines. Evaluations are conducted using modified Inception Score and Fr{\'e}chet Inception Distance metrics, that we specifically design for discrete sequential generated data. The distribution of generated sequences is shown to approach closely that of real data. In particular our model properly learns the dynamics of interaction sequences, while exploiting the full range of actions.

Via

Access Paper or Ask Questions

Behavioral decision-making for urban autonomous driving in the presence of pedestrians using Deep Recurrent Q-Network

Oct 26, 2020

Niranjan Deshpande, Dominique Vaufreydaz, Anne Spalanzani

Figure 1 for Behavioral decision-making for urban autonomous driving in the presence of pedestrians using Deep Recurrent Q-Network

Figure 2 for Behavioral decision-making for urban autonomous driving in the presence of pedestrians using Deep Recurrent Q-Network

Figure 3 for Behavioral decision-making for urban autonomous driving in the presence of pedestrians using Deep Recurrent Q-Network

Figure 4 for Behavioral decision-making for urban autonomous driving in the presence of pedestrians using Deep Recurrent Q-Network

Abstract:Decision making for autonomous driving in urban environments is challenging due to the complexity of the road structure and the uncertainty in the behavior of diverse road users. Traditional methods consist of manually designed rules as the driving policy, which require expert domain knowledge, are difficult to generalize and might give sub-optimal results as the environment gets complex. Whereas, using reinforcement learning, optimal driving policy could be learned and improved automatically through several interactions with the environment. However, current research in the field of reinforcement learning for autonomous driving is mainly focused on highway setup with little to no emphasis on urban environments. In this work, a deep reinforcement learning based decision-making approach for high-level driving behavior is proposed for urban environments in the presence of pedestrians. For this, the use of Deep Recurrent Q-Network (DRQN) is explored, a method combining state-of-the art Deep Q-Network (DQN) with a long term short term memory (LSTM) layer helping the agent gain a memory of the environment. A 3-D state representation is designed as the input combined with a well defined reward function to train the agent for learning an appropriate behavior policy in a real-world like urban simulator. The proposed method is evaluated for dense urban scenarios and compared with a rule-based approach and results show that the proposed DRQN based driving behavior decision maker outperforms the rule-based approach.

* 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Dec 2020, Shenzhen, China

Via

Access Paper or Ask Questions