Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antje Schweitzer

The IMS Toucan System for the Blizzard Challenge 2023

Oct 26, 2023

Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu

Abstract:For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.

* Published at the Blizzard Challenge Workshop 2023, colocated with the Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023

Via

Access Paper or Ask Questions

Modeling Speaker-Listener Interaction for Backchannel Prediction

Apr 10, 2023

Daniel Ortega, Sarina Meyer, Antje Schweitzer, Ngoc Thang Vu

Figure 1 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Figure 2 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Figure 3 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Figure 4 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Abstract:We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subsequent talk, and the consequent dynamic speaker-listener interaction. Therefore, we propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech, capturing and imitating listeners' backchanneling behavior, and encoding speaker-listener interaction. Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions. More importantly, a proper interaction encoding strategy, i.e., combining the speaker and listener embeddings, leads to the best performance on both datasets in terms of F1-score.

* Published in IWSDS 2023

Via

Access Paper or Ask Questions

"splink" is happy and "phrouth" is scary: Emotion Intensity Analysis for Nonsense Words

Apr 05, 2022

Valentino Sabbatino, Enrica Troiano, Antje Schweitzer, Roman Klinger

Figure 1 for "splink" is happy and "phrouth" is scary: Emotion Intensity Analysis for Nonsense Words

Figure 2 for "splink" is happy and "phrouth" is scary: Emotion Intensity Analysis for Nonsense Words

Figure 3 for "splink" is happy and "phrouth" is scary: Emotion Intensity Analysis for Nonsense Words

Figure 4 for "splink" is happy and "phrouth" is scary: Emotion Intensity Analysis for Nonsense Words

Abstract:People associate affective meanings to words - "death" is scary and sad while "party" is connotated with surprise and joy. This raises the question if the association is purely a product of the learned affective imports inherent to semantic meanings, or is also an effect of other features of words, e.g., morphological and phonological patterns. We approach this question with an annotation-based analysis leveraging nonsense words. Specifically, we conduct a best-worst scaling crowdsourcing study in which participants assign intensity scores for joy, sadness, anger, disgust, fear, and surprise to 272 non-sense words and, for comparison of the results to previous work, to 68 real words. Based on this resource, we develop character-level and phonology-based intensity regressors. We evaluate them on both nonsense words and real words (making use of the NRC emotion intensity lexicon of 7493 words), across six emotion categories. The analysis of our data reveals that some phonetic patterns show clear differences between emotion intensities. For instance, s as a first phoneme contributes to joy, sh to surprise, p as last phoneme more to disgust than to anger and fear. In the modelling experiments, a regressor trained on real words from the NRC emotion intensity lexicon shows a higher performance (r = 0.17) than regressors that aim at learning the emotion connotation purely from nonsense words. We conclude that humans do associate affective meaning to words based on surface patterns, but also based on similarities to existing words ("juy" to "joy", or "flike" to "like").

* accepted for WASSA 2022 at ACL 2022

Via

Access Paper or Ask Questions

Effects of Word Embeddings on Neural Network-based Pitch Accent Detection

Jun 07, 2018

Sabrina Stehwien, Ngoc Thang Vu, Antje Schweitzer

Figure 1 for Effects of Word Embeddings on Neural Network-based Pitch Accent Detection

Figure 2 for Effects of Word Embeddings on Neural Network-based Pitch Accent Detection

Figure 3 for Effects of Word Embeddings on Neural Network-based Pitch Accent Detection

Figure 4 for Effects of Word Embeddings on Neural Network-based Pitch Accent Detection

Abstract:Pitch accent detection often makes use of both acoustic and lexical features based on the fact that pitch accents tend to correlate with certain words. In this paper, we extend a pitch accent detector that involves a convolutional neural network to include word embeddings, which are state-of-the-art vector representations of words. We examine the effect these features have on within-corpus and cross-corpus experiments on three English datasets. The results show that while word embeddings can improve the performance in corpus-dependent experiments, they also have the potential to make generalization to unseen data more challenging.

* This is an updated version of the paper that has been accepted at Speech Prosody 2018 and published on the ISCA archive. The updates consist of minor corrections that do not change the main conclusions in this work

Via

Access Paper or Ask Questions