Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhaocheng Huang

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Nov 01, 2023

Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico

Figure 1 for End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Figure 2 for End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Figure 3 for End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Figure 4 for End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Abstract:Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.

* Accepted at EMNLP 2023. Code: https://github.com/amazon-science/stac-speech-translation

Via

Access Paper or Ask Questions

Representation learning through cross-modal conditional teacher-student training for speech emotion recognition

Nov 30, 2021

Sundararajan Srinivasan, Zhaocheng Huang, Katrin Kirchhoff

Figure 1 for Representation learning through cross-modal conditional teacher-student training for speech emotion recognition

Figure 2 for Representation learning through cross-modal conditional teacher-student training for speech emotion recognition

Figure 3 for Representation learning through cross-modal conditional teacher-student training for speech emotion recognition

Abstract:Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show that the primary difference between the top-performing representations is in predicting valence while the differences in predicting activation and dominance dimensions are less pronounced. However, we show that even the best-performing HuBERT representation underperforms on valence prediction compared to a multimodal model that also incorporates text representation. We address this shortcoming by injecting lexical information into the speech representation using the multimodal model as a teacher. To improve the efficacy of our approach, we propose a novel estimate of the quality of the emotion predictions, to condition teacher-student training. We report new audio-only state-of-the-art concordance correlation coefficient (CCC) values of 0.757, 0.627, 0.671 for activation, valence and dominance predictions, respectively, on the MSP-Podcast corpus, and also state-of-the-art values of 0.667, 0.582, 0.545 on the IEMOCAP corpus.

Via

Access Paper or Ask Questions