Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junyi Ao

Audio Deepfake Verification

Sep 10, 2025

Li Wang, Junyi Ao, Linyong Gan, Yuancheng Wang, Xueyao Zhang, Zhizheng Wu

Abstract:With the rapid development of deepfake technology, simply making a binary judgment of true or false on audio is no longer sufficient to meet practical needs. Accurately determining the specific deepfake method has become crucial. This paper introduces the Audio Deepfake Verification (ADV) task, effectively addressing the limitations of existing deepfake source tracing methods in closed-set scenarios, aiming to achieve open-set deepfake source tracing. Meanwhile, the Audity dual-branch architecture is proposed, extracting deepfake features from two dimensions: audio structure and generation artifacts. Experimental results show that the dual-branch Audity architecture outperforms any single-branch configuration, and it can simultaneously achieve excellent performance in both deepfake detection and verification tasks.

Via

Access Paper or Ask Questions

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

Mar 19, 2025

Junyi Ao, Dekun Chen, Xiaohai Tian, Wenjie Feng, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

Abstract:Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. To rigorously evaluate Solla and other publicly available models, we propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering. SA-Eval has diverse speech instruction with various speaking styles, encompassing two difficulty levels, easy and hard, to capture the range of real-world acoustic conditions. Experimental results show that Solla performs on par with or outperforms baseline models on both the easy and hard test sets, underscoring its effectiveness in jointly understanding speech and audio.

Via

Access Paper or Ask Questions

Overview of the Amphion Toolkit (v0.2)

Jan 26, 2025

Jiaqi Li, Xueyao Zhang, Yuancheng Wang, Haorui He, Chaoren Wang, Li Wang, Huan Liao, Junyi Ao, Zeyu Xie, Yiqiao Huang(+2 more)

Figure 1 for Overview of the Amphion Toolkit (v0.2)

Figure 2 for Overview of the Amphion Toolkit (v0.2)

Figure 3 for Overview of the Amphion Toolkit (v0.2)

Figure 4 for Overview of the Amphion Toolkit (v0.2)

Abstract:Amphion is an open-source toolkit for Audio, Music, and Speech Generation, designed to lower the entry barrier for junior researchers and engineers in these fields. It provides a versatile framework that supports a variety of generation tasks and models. In this report, we introduce Amphion v0.2, the second major release developed in 2024. This release features a 100K-hour open-source multilingual dataset, a robust data preparation pipeline, and novel models for tasks such as text-to-speech, audio coding, and voice conversion. Furthermore, the report includes multiple tutorials that guide users through the functionalities and usage of the newly released models.

* Github: https://github.com/open-mmlab/Amphion

Via

Access Paper or Ask Questions

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Jul 03, 2024

Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li

Figure 1 for SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Figure 2 for SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Figure 3 for SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

Abstract:It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction. In this pipeline, SA-WavLM performs speaker-informed extractions with the consideration of the interactions between different speakers. Furthermore, a speaker shuffling strategy is proposed to enhance the robustness towards the speaker absence. Experiments show that SA-WavLM either matches or improves upon the state-of-the-art pre-trained models.

* InterSpeech 2024

Via

Access Paper or Ask Questions

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Jun 19, 2024

Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

Figure 1 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Figure 2 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Figure 3 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Figure 4 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Abstract:Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https://github.com/amphionspace/SD-Eval.

Via

Access Paper or Ask Questions

Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks

Feb 28, 2024

Duo Ma, Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li

Figure 1 for Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks

Figure 2 for Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks

Figure 3 for Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks

Figure 4 for Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks

Abstract:Human language can be expressed in either written or spoken form, i.e. text or speech. Humans can acquire knowledge from text to improve speaking and listening. However, the quest for speech pre-trained models to leverage unpaired text has just started. In this paper, we investigate a new way to pre-train such a joint speech-text model to learn enhanced speech representations and benefit various speech-related downstream tasks. Specifically, we propose a novel pre-training method, text-guided HuBERT, or T-HuBERT, which performs self-supervised learning over speech to derive phoneme-like discrete representations. And these phoneme-like pseudo-label sequences are firstly derived from speech via the generative adversarial networks (GAN) to be statistically similar to those from additional unpaired textual data. In this way, we build a bridge between unpaired speech and text in an unsupervised manner. Extensive experiments demonstrate the significant superiority of our proposed method over various strong baselines, which achieves up to 15.3% relative Word Error Rate (WER) reduction on the LibriSpeech dataset.

* 5 pages, 1 figures,5 tables, submit to IEEE Signal Processing Letters(SPL)

Via

Access Paper or Ask Questions

The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge

Dec 26, 2023

Meng Ge, Yizhou Peng, Yidi Jiang, Jingru Lin, Junyi Ao, Mehmet Sinan Yildirim, Shuai Wang, Haizhou Li, Mengling Feng

Abstract:This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves a relative 34.3% improvement in CER and 56.5% improvement in cpCER, compared to the offical baseline system.

* Technical Report. 2 pages. For ICMC-ASR-2023 Challenge

Via

Access Paper or Ask Questions

USED: Universal Speaker Extraction and Diarization

Sep 19, 2023

Junyi Ao, Mehmet Sinan Yıldırım, Meng Ge, Shuai Wang, Ruijie Tao, Yanmin Qian, Liqun Deng, Longshuai Xiao, Haizhou Li

Figure 1 for USED: Universal Speaker Extraction and Diarization

Figure 2 for USED: Universal Speaker Extraction and Diarization

Figure 3 for USED: Universal Speaker Extraction and Diarization

Figure 4 for USED: Universal Speaker Extraction and Diarization

Abstract:Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at https://ajyy.github.io/demo/USED/.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

Jul 19, 2023

Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li

Figure 1 for Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

Figure 2 for Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

Figure 3 for Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

Figure 4 for Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

Abstract:Acoustic word embeddings (AWEs) aims to map a variable-length speech segment into a fixed-dimensional representation. High-quality AWEs should be invariant to variations, such as duration, pitch and speaker. In this paper, we introduce a novel self-supervised method to learn robust AWEs from a large-scale unlabelled speech corpus. Our model, named Correspondence Transformer Encoder (CTE), employs a teacher-student learning framework. We train the model based on the idea that different realisations of the same word should be close in the underlying embedding space. Specifically, we feed the teacher and student encoder with different acoustic instances of the same word and pre-train the model with a word-level loss. Our experiments show that the embeddings extracted from the proposed CTE model are robust to speech variations, e.g. speakers and domains. Additionally, when evaluated on Xitsonga, a low-resource cross-lingual setting, the CTE model achieves new state-of-the-art performance.

Via

Access Paper or Ask Questions

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Oct 30, 2022

Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li

Figure 1 for token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Figure 2 for token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Figure 3 for token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Figure 4 for token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Abstract:Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text. In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem. Secondly, to solve the length mismatch problem, where the speech sequence is usually much longer than text sequence, we convert the words of text into phoneme sequences and randomly repeat each phoneme in the sequences. Finally, we feed the discrete speech and text tokens into a modality-agnostic Transformer encoder and pre-train with token-level masking language modeling (tMLM). Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction. Token2vec model is also validated on a non-ASR task, i.e., spoken intent classification, and shows good transferability.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions