Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lantian Li

An Investigation on Speaker Augmentation for End-to-End Speaker Extraction

May 27, 2025

Zhenghai You, Zhenyu Zhou, Lantian Li, Dong Wang

Abstract:Target confusion, defined as occasional switching to non-target speakers, poses a key challenge for end-to-end speaker extraction (E2E-SE) systems. We argue that this problem is largely caused by the lack of generalizability and discrimination of the speaker embeddings, and introduce a simple yet effective speaker augmentation strategy to tackle the problem. Specifically, we propose a time-domain resampling and rescaling pipeline that alters speaker traits while preserving other speech properties. This generates a variety of pseudo-speakers to help establish a generalizable speaker embedding space, while the speaker-trait-specific augmentation creates hard samples that force the model to focus on genuine speaker characteristics. Experiments on WSJ0-2Mix and LibriMix show that our method mitigates the target confusion and improves extraction performance. Moreover, it can be combined with metric learning, another effective approach to address target confusion, leading to further gains.

Via

Access Paper or Ask Questions

Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

Oct 21, 2024

Wan Lin, Junhui Chen, Tianhao Wang, Zhenyu Zhou, Lantian Li, Dong Wang

Figure 1 for Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

Figure 2 for Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

Figure 3 for Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

Abstract:Current mainstream speaker verification systems are predominantly based on the concept of ``speaker embedding", which transforms variable-length speech signals into fixed-length speaker vectors, followed by verification based on cosine similarity between the embeddings of the enrollment and test utterances. However, this approach suffers from considerable performance degradation in the presence of severe noise and interference speakers. This paper introduces Neural Scoring, a novel framework that re-treats speaker verification as a scoring task using a Transformer-based architecture. The proposed method first extracts an embedding from the enrollment speech and frame-level features from the test speech. A Transformer network then generates a decision score that quantifies the likelihood of the enrolled speaker being present in the test speech. We evaluated Neural Scoring on the VoxCeleb dataset across five test scenarios, comparing it with the state-of-the-art embedding-based approach. While Neural Scoring achieves comparable performance to the state-of-the-art under the benchmark (clean) test condition, it demonstrates a remarkable advantage in the four complex scenarios, achieving an overall 64.53% reduction in equal error rate (EER) compared to the baseline.

Via

Access Paper or Ask Questions

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Oct 21, 2024

Zehua Liu, Xiaolou Li, Chen Chen, Li Guo, Lantian Li, Dong Wang

Figure 1 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Figure 2 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Figure 3 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Figure 4 for AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Abstract:Visual Speech Recognition (VSR) aims to recognize corresponding text by analyzing visual information from lip movements. Due to the high variability and weak information of lip movements, VSR tasks require effectively utilizing any information from any source and at any level. In this paper, we propose a VSR method based on audio-visual cross-modal alignment, named AlignVSR. The method leverages the audio modality as an auxiliary information source and utilizes the global and local correspondence between the audio and visual modalities to improve visual-to-text inference. Specifically, the method first captures global alignment between video and audio through a cross-modal attention mechanism from video frames to a bank of audio units. Then, based on the temporal correspondence between audio and video, a frame-level local alignment loss is introduced to refine the global alignment, improving the utility of the audio information. Experimental results on the LRS2 and CNVSRC.Single datasets consistently show that AlignVSR outperforms several mainstream VSR methods, demonstrating its superior and robust performance.

Via

Access Paper or Ask Questions

Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Sep 29, 2024

Chen Chen, Xiaolou Li, Zehua Liu, Lantian Li, Dong Wang

Figure 1 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Figure 2 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Figure 3 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Figure 4 for Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective

Abstract:In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.

* Accepted by ISCSLP2024

Via

Access Paper or Ask Questions

Serialized Output Training by Learned Dominance

Jul 04, 2024

Ying Shi, Lantian Li, Shi Yin, Dong Wang, Jiqing Han

Abstract:Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.

* accepted by INTERSPEECH 2024

Via

Access Paper or Ask Questions

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Jun 14, 2024

Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang

Figure 1 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Figure 2 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Figure 3 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Figure 4 for CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Abstract:The first Chinese Continuous Visual Speech Recognition Challenge aimed to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR) on two tasks: (1) Single-speaker VSR for a particular speaker and (2) Multi-speaker VSR for a set of registered speakers. The challenge yielded highly successful results, with the best submission significantly outperforming the baseline, particularly in the single-speaker task. This paper comprehensively reviews the challenge, encompassing the data profile, task specifications, and baseline system construction. It also summarises the representative techniques employed by the submitted systems, highlighting the most effective approaches. Additional information and resources about this challenge can be accessed through the official website at http://cnceleb.org/competition.

* Accepted by INTERSPEECH 2024

Via

Access Paper or Ask Questions

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Jun 12, 2024

Tianhao Wang, Lantian Li, Dong Wang

Figure 1 for SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Figure 2 for SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Figure 3 for SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Figure 4 for SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Abstract:Deploying a well-optimized pre-trained speaker recognition model in a new domain often leads to a significant decline in performance. While fine-tuning is a commonly employed solution, it demands ample adaptation data and suffers from parameter inefficiency, rendering it impractical for real-world applications with limited data available for model adaptation. Drawing inspiration from the success of adapters in self-supervised pre-trained models, this paper introduces a SE/BN adapter to address this challenge. By freezing the core speaker encoder and adjusting the feature maps' weights and activation distributions, we introduce a novel adapter utilizing trainable squeeze-and-excitation (SE) blocks and batch normalization (BN) layers, termed SE/BN adapter. Our experiments, conducted using VoxCeleb for pre-training and 4 genres from CN-Celeb for adaptation, demonstrate that the SE/BN adapter offers significant performance improvement over the baseline and competes with the vanilla fine-tuning approach by tuning just 1% of the parameters.

* to be published in INTERSPEECH 2024

Via

Access Paper or Ask Questions

Zero-Shot Fake Video Detection by Audio-Visual Consistency

Jun 12, 2024

Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang

Figure 1 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Figure 2 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Figure 3 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Figure 4 for Zero-Shot Fake Video Detection by Audio-Visual Consistency

Abstract:Recent studies have advocated the detection of fake videos as a one-class detection task, predicated on the hypothesis that the consistency between audio and visual modalities of genuine data is more significant than that of fake data. This methodology, which solely relies on genuine audio-visual data while negating the need for forged counterparts, is thus delineated as a `zero-shot' detection paradigm. This paper introduces a novel zero-shot detection approach anchored in content consistency across audio and video. By employing pre-trained ASR and VSR models, we recognize the audio and video content sequences, respectively. Then, the edit distance between the two sequences is computed to assess whether the claimed video is genuine. Experimental results indicate that, compared to two mainstream approaches based on semantic consistency and temporal consistency, our approach achieves superior generalizability across various deepfake techniques and demonstrates strong robustness against audio-visual perturbations. Finally, state-of-the-art performance gains can be achieved by simply integrating the decision scores of these three systems.

* to be published in INTERSPEECH 2024

Via

Access Paper or Ask Questions

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Jun 11, 2024

Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang

Figure 1 for A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Figure 2 for A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Figure 3 for A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Figure 4 for A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Abstract:Data augmentation (DA) has played a pivotal role in the success of deep speaker recognition. Current DA techniques primarily focus on speaker-preserving augmentation, which does not change the speaker trait of the speech and does not create new speakers. Recent research has shed light on the potential of speaker augmentation, which generates new speakers to enrich the training dataset. In this study, we delve into two speaker augmentation approaches: speed perturbation (SP) and vocal tract length perturbation (VTLP). Despite the empirical utilization of both methods, a comprehensive investigation into their efficacy is lacking. Our study, conducted using two public datasets, VoxCeleb and CN-Celeb, revealed that both SP and VTLP are proficient at generating new speakers, leading to significant performance improvements in speaker recognition. Furthermore, they exhibit distinct properties in sensitivity to perturbation factors and data complexity, hinting at the potential benefits of their fusion. Our research underscores the substantial potential of speaker augmentation, highlighting the importance of in-depth exploration and analysis.

* to be published in INTERSPEECH 2024

Via

Access Paper or Ask Questions

How phonemes contribute to deep speaker models?

Feb 05, 2024

Pengqi Li, Tianhao Wang, Lantian Li, Askar Hamdulla, Dong Wang

Abstract:Which phonemes convey more speaker traits is a long-standing question, and various perception experiments were conducted with human subjects. For speaker recognition, studies were conducted with the conventional statistical models and the drawn conclusions are more or less consistent with the perception results. However, which phonemes are more important with modern deep neural models is still unexplored, due to the opaqueness of the decision process. This paper conducts a novel study for the attribution of phonemes with two types of deep speaker models that are based on TDNN and CNN respectively, from the perspective of model explanation. Specifically, we conducted the study by two post-explanation methods: LayerCAM and Time Align Occlusion (TAO). Experimental results showed that: (1) At the population level, vowels are more important than consonants, confirming the human perception studies. However, fricatives are among the most unimportant phonemes, which contrasts with previous studies. (2) At the speaker level, a large between-speaker variation is observed regarding phoneme importance, indicating that whether a phoneme is important or not is largely speaker-dependent.

Via

Access Paper or Ask Questions