Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Raul Fernandez

Exploring the Benefits of Tokenization of Discrete Acoustic Units

Jun 08, 2024

Avihu Dekel, Raul Fernandez

Figure 1 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Figure 2 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Figure 3 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Figure 4 for Exploring the Benefits of Tokenization of Discrete Acoustic Units

Abstract:Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.

* Interspeech 2024

Via

Access Paper or Ask Questions

Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

Mar 17, 2024

Claudio Pinhanez, Raul Fernandez, Marcelo Grave, Julio Nogima, Ron Hoory

Figure 1 for Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

Figure 2 for Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

Figure 3 for Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

Figure 4 for Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

Abstract:Representations of AI agents in user interfaces and robotics are predominantly White, not only in terms of facial and skin features, but also in the synthetic voices they use. In this paper we explore some unexpected challenges in the representation of race we found in the process of developing an U.S. English Text-to-Speech (TTS) system aimed to sound like an educated, professional, regional accent-free African American woman. The paper starts by presenting the results of focus groups with African American IT professionals where guidelines and challenges for the creation of a representative and appropriate TTS system were discussed and gathered, followed by a discussion about some of the technical difficulties faced by the TTS system developers. We then describe two studies with U.S. English speakers where the participants were not able to attribute the correct race to the African American TTS voice while overwhelmingly correctly recognizing the race of a White TTS system of similar quality. A focus group with African American IT workers not only confirmed the representativeness of the African American voice we built, but also suggested that the surprising recognition results may have been caused by the inability or the latent prejudice from non-African Americans to associate educated, non-vernacular, professionally-sounding voices to African American people.

* Full version including appendixes

Via

Access Paper or Ask Questions

A Neural TTS System with Parallel Prosody Transfer from Unseen Speakers

Sep 20, 2023

Slava Shechtman, Raul Fernandez

Abstract:Modern neural TTS systems are capable of generating natural and expressive speech when provided with sufficient amounts of training data. Such systems can be equipped with prosody-control functionality, allowing for more direct shaping of the speech output at inference time. In some TTS applications, it may be desirable to have an option that guides the TTS system with an ad-hoc speech recording exemplar to impose an implicit fine-grained, user-preferred prosodic realization for certain input prompts. In this work we present a first-of-its-kind neural TTS system equipped with such functionality to transfer the prosody from a parallel text recording from an unseen speaker. We demonstrate that the proposed system can precisely transfer the speech prosody from novel speakers to various trained TTS voices with no quality degradation, while preserving the target TTS speakers' identity, as evaluated by a set of subjective listening experiments.

* Proc. INTERSPEECH 2023, 4853-4857 (2023)
* Presented at Interspeech 2023

Via

Access Paper or Ask Questions

Speak While You Think: Streaming Speech Synthesis During Text Generation

Sep 20, 2023

Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Zvi Kons, Ron Hoory

Figure 1 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Figure 2 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Figure 3 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Figure 4 for Speak While You Think: Streaming Speech Synthesis During Text Generation

Abstract:Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction. LLM2Speech mimics the predictions of a non-streaming teacher model while limiting the exposure to future context in order to enable streaming. It exploits the hidden embeddings of the LLM, a by-product of the text generation that contains informative semantic context. Experimental results show that LLM2Speech maintains the teacher's quality while reducing the latency to enable natural conversations.

* Under review for ICASSP 2024

Via

Access Paper or Ask Questions

Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

Jul 25, 2022

Raul Fernandez, David Haws, Guy Lorberbom, Slava Shechtman, Alexander Sorin

Figure 1 for Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

Figure 2 for Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

Figure 3 for Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

Figure 4 for Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

Abstract:Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate amounts of training data. Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creating scalable and customizable Human-Computer Interaction systems. In this work we explore one-to-many style transfer from a dedicated single-speaker conversational corpus with style nuances and interjections. We elaborate on the corpus design and explore the feasibility of such style transfer when assisted with Voice-Conversion-based data augmentation. In a set of subjective listening experiments, this approach resulted in high-fidelity style transfer with no quality degradation. However, a certain voice persona shift was observed, requiring further improvements in voice conversion.

* Accepted for presentation at Interspeech 2022

Via

Access Paper or Ask Questions

Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Jan 25, 2021

Slava Shechtman, Raul Fernandez, David Haws

Figure 1 for Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Figure 2 for Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Figure 3 for Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Figure 4 for Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Abstract:Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited by a lack of flexibility when it comes to controlling the output. In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled parameters. We apply this framework to the realization of emphatic lexical focus, proposing a variety of architectures designed to exploit different levels of supervision based on the availability of labeled resources. We evaluate these approaches via listening tests that demonstrate we are able to successfully realize controllable focus while maintaining the same, or higher, naturalness over an established baseline, and we explore how the different approaches compare when synthesizing in a target voice with or without labeled data.

* IEEE Spoken Language Technology Workshop (SLT), 2021

Via

Access Paper or Ask Questions