Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Julie Carson-Berndsen

University of Bielefeld, Germany

Real-Time Textless Dialogue Generation

Jan 08, 2025

Long Mai, Julie Carson-Berndsen

Abstract:Recent advancements in large language models (LLMs) have led to significant progress in text-based dialogue systems. These systems can now generate high-quality responses that are accurate and coherent across a wide range of topics and tasks. However, spoken dialogue systems still lag behind in terms of naturalness. They tend to produce robotic interactions, with issues such as slow response times, overly generic or cautious replies, and a lack of natural rhythm and fluid turn-taking. This shortcoming is largely due to the over-reliance on the traditional cascaded design, which involve separate, sequential components, as well as the use of text as an intermediate representation. This paper propose a real-time, textless spoken dialogue generation model (RTTL-DG) that aims to overcome these challenges. Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Additionally, our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems, to create more natural and human-like interactions. The implementations and generated samples are available in our repository: https://github.com/mailong25/rts2s-dg

Via

Access Paper or Ask Questions

Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning

Dec 04, 2024

Long Mai, Julie Carson-Berndsen

Figure 1 for Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning

Figure 2 for Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning

Figure 3 for Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning

Figure 4 for Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning

Abstract:While Large Language Models (LLMs) have made significant strides in replicating human-like abilities, there are concerns about a reduction in the linguistic diversity of their outputs. This results in the homogenization of viewpoints and perspectives, as well as the underrepresentation of specific demographic groups. Although several fine-tuning and prompting techniques have been suggested to tackle the issue, they are often tailored to specific tasks or come with a substantial increase in computational cost and latency. This makes them challenging to apply to applications that demand very low latency, such as chatbots and virtual assistants. We propose Possibility Exploration Fine-Tuning (PEFT), a task-agnostic framework that enhances the text diversity of LLMs without increasing latency or computational cost. Given the same prompt, models fine-tuned with PEFT can simultaneously generate multiple diverse responses, each corresponding with a controllable possibility number. Experiments on dialogue and story generation tasks demonstrate that PEFT significantly enhances the diversity of LLM outputs, as evidenced by lower similarity between candidate responses. Since PEFT emphasizes semantic diversity over lexical diversity, it can also notably reduce demographic bias in dialogue systems. The implementations and datasets are available in our repository: https://github.com/mailong25/peft_diversity

Via

Access Paper or Ask Questions

Enhancing conversational quality in language learning chatbots: An evaluation of GPT4 for ASR error correction

Jul 19, 2023

Long Mai, Julie Carson-Berndsen

Abstract:The integration of natural language processing (NLP) technologies into educational applications has shown promising results, particularly in the language learning domain. Recently, many spoken open-domain chatbots have been used as speaking partners, helping language learners improve their language skills. However, one of the significant challenges is the high word-error-rate (WER) when recognizing non-native/non-fluent speech, which interrupts conversation flow and leads to disappointment for learners. This paper explores the use of GPT4 for ASR error correction in conversational settings. In addition to WER, we propose to use semantic textual similarity (STS) and next response sensibility (NRS) metrics to evaluate the impact of error correction models on the quality of the conversation. We find that transcriptions corrected by GPT4 lead to higher conversation quality, despite an increase in WER. GPT4 also outperforms standard error correction methods without the need for in-domain training data.

Via

Access Paper or Ask Questions

Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

May 12, 2023

Emma O'Neill, Julie Carson-Berndsen

Figure 1 for Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

Figure 2 for Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

Figure 3 for Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

Figure 4 for Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

Abstract:Automatic Speech Recognition (ASR) systems exhibit the best performance on speech that is similar to that on which it was trained. As such, underrepresented varieties including regional dialects, minority-speakers, and low-resource languages, see much higher word error rates (WERs) than those varieties seen as 'prestigious', 'mainstream', or 'standard'. This can act as a barrier to incorporating ASR technology into the annotation process for large-scale linguistic research since the manual correction of the erroneous automated transcripts can be just as time and resource consuming as manual transcriptions. A deeper understanding of the behaviour of an ASR system is thus beneficial from a speech technology standpoint, in terms of improving ASR accuracy, and from an annotation standpoint, where knowing the likely errors made by an ASR system can aid in this manual correction. This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes. Specifically, how particular phonetic realisations which were rare or absent in the system's training data can lead to phoneme level misrecognitions and contribute to higher WERs. It is demonstrated that the behaviour of the ASR is systematic and consistent across speakers with similar spoken varieties (in this case the same L1) and phoneme substitution errors are typically in agreement with human annotators. By identifying problematic productions specific weaknesses can be addressed by sourcing such realisations for training and fine-tuning thus making the system more robust to pronunciation variation.

Via

Access Paper or Ask Questions

Unsupervised domain adaptation for speech recognition with unsupervised error correction

Sep 24, 2022

Long Mai, Julie Carson-Berndsen

Figure 1 for Unsupervised domain adaptation for speech recognition with unsupervised error correction

Figure 2 for Unsupervised domain adaptation for speech recognition with unsupervised error correction

Figure 3 for Unsupervised domain adaptation for speech recognition with unsupervised error correction

Figure 4 for Unsupervised domain adaptation for speech recognition with unsupervised error correction

Abstract:The transcription quality of automatic speech recognition (ASR) systems degrades significantly when transcribing audios coming from unseen domains. We propose an unsupervised error correction method for unsupervised ASR domain adaption, aiming to recover transcription errors caused by domain mismatch. Unlike existing correction methods that rely on transcribed audios for training, our approach requires only unlabeled data of the target domains in which a pseudo-labeling technique is applied to generate correction training samples. To reduce over-fitting to the pseudo data, we also propose an encoder-decoder correction model that can take into account additional information such as dialogue context and acoustic features. Experiment results show that our method obtains a significant word error rate (WER) reduction over non-adapted ASR systems. The correction model can also be applied on top of other adaptation approaches to bring an additional improvement of 10% relatively.

* Interspeech 2022

Via

Access Paper or Ask Questions

Phoneme Recognition Using Acoustic Events

Jul 15, 1994

Kai Huebener, Julie Carson-Berndsen

Figure 1 for Phoneme Recognition Using Acoustic Events

Abstract:This paper presents a new approach to phoneme recognition using nonsequential sub--phoneme units. These units are called acoustic events and are phonologically meaningful as well as recognizable from speech signals. Acoustic events form a phonologically incomplete representation as compared to distinctive features. This problem may partly be overcome by incorporating phonological constraints. Currently, 24 binary events describing manner and place of articulation, vowel quality and voicing are used to recognize all German phonemes. Phoneme recognition in this paradigm consists of two steps: After the acoustic events have been determined from the speech signal, a phonological parser is used to generate syllable and phoneme hypotheses from the event lattice. Results obtained on a speaker--dependent corpus are presented.

* 4 pages, to appear at ICSLP'94, PostScript version (compressed and uuencoded)

Via

Access Paper or Ask Questions