Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaolou Li

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Oct 21, 2024

Zehua Liu, Xiaolou Li, Chen Chen, Li Guo, Lantian Li, Dong Wang

Figure 1 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Figure 2 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Figure 3 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Figure 4 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Abstract:Visual Speech Recognition (VSR) aims to recognize corresponding text by analyzing visual information from lip movements. Due to the high variability and weak information of lip movements, VSR tasks require effectively utilizing any information from any source and at any level. In this paper, we propose a VSR method based on audio-visual cross-modal alignment, named AlignVSR. The method leverages the audio modality as an auxiliary information source and utilizes the global and local correspondence between the audio and visual modalities to improve visual-to-text inference. Specifically, the method first captures global alignment between video and audio through a cross-modal attention mechanism from video frames to a bank of audio units. Then, based on the temporal correspondence between audio and video, a frame-level local alignment loss is introduced to refine the global alignment, improving the utility of the audio information. Experimental results on the LRS2 and CNVSRC.Single datasets consistently show that AlignVSR outperforms several mainstream VSR methods, demonstrating its superior and robust performance.

Via

Access Paper or Ask Questions

Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Sep 29, 2024

Chen Chen, Xiaolou Li, Zehua Liu, Lantian Li, Dong Wang

Figure 1 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Figure 2 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Figure 3 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Figure 4 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Abstract:In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.

* Accepted by ISCSLP2024

Via

Access Paper or Ask Questions

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Jun 14, 2024

Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang

Figure 1 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Figure 2 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Figure 3 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Figure 4 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Abstract:The first Chinese Continuous Visual Speech Recognition Challenge aimed to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR) on two tasks: (1) Single-speaker VSR for a particular speaker and (2) Multi-speaker VSR for a set of registered speakers. The challenge yielded highly successful results, with the best submission significantly outperforming the baseline, particularly in the single-speaker task. This paper comprehensively reviews the challenge, encompassing the data profile, task specifications, and baseline system construction. It also summarises the representative techniques employed by the submitted systems, highlighting the most effective approaches. Additional information and resources about this challenge can be accessed through the official website at http://cnceleb.org/competition.

* Accepted by INTERSPEECH 2024

Via

Access Paper or Ask Questions

Zero-Shot Fake Video Detection by Audio-Visual Consistency

Jun 12, 2024

Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang

Figure 1 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Figure 2 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Figure 3 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Figure 4 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Abstract:Recent studies have advocated the detection of fake videos as a one-class detection task, predicated on the hypothesis that the consistency between audio and visual modalities of genuine data is more significant than that of fake data. This methodology, which solely relies on genuine audio-visual data while negating the need for forged counterparts, is thus delineated as a `zero-shot' detection paradigm. This paper introduces a novel zero-shot detection approach anchored in content consistency across audio and video. By employing pre-trained ASR and VSR models, we recognize the audio and video content sequences, respectively. Then, the edit distance between the two sequences is computed to assess whether the claimed video is genuine. Experimental results indicate that, compared to two mainstream approaches based on semantic consistency and temporal consistency, our approach achieves superior generalizability across various deepfake techniques and demonstrates strong robustness against audio-visual perturbations. Finally, state-of-the-art performance gains can be achieved by simply integrating the decision scores of these three systems.

* to be published in INTERSPEECH 2024

Via

Access Paper or Ask Questions

CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition

May 25, 2023

Lantian Li, Xiaolou Li, Haoyu Jiang, Chen Chen, Ruihai Hou, Dong Wang

Figure 1 for CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition

Figure 2 for CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition

Figure 3 for CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition

Figure 4 for CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition

Abstract:Audio-visual person recognition (AVPR) has received extensive attention. However, most datasets used for AVPR research so far are collected in constrained environments, and thus cannot reflect the true performance of AVPR systems in real-world scenarios. To meet the request for research on AVPR in unconstrained conditions, this paper presents a multi-genre AVPR dataset collected `in the wild', named CN-Celeb-AV. This dataset contains more than 420k video segments from 1,136 persons from public media. In particular, we put more emphasis on two real-world complexities: (1) data in multiple genres; (2) segments with partial information. A comprehensive study was conducted to compare CN-Celeb-AV with two popular public AVPR benchmark datasets, and the results demonstrated that CN-Celeb-AV is more in line with real-world scenarios and can be regarded as a new benchmark dataset for AVPR research. The dataset also involves a development set that can be used to boost the performance of AVPR systems in real-life situations. The dataset is free for researchers and can be downloaded from http://cnceleb.org/.

* to be published in INTERSPEECH 2023

Via

Access Paper or Ask Questions