Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Beata Lorincz

FlexLip: A Controllable Text-to-Lip System

Jun 07, 2022

Dan Oneata, Beata Lorincz, Adriana Stan, Horia Cucu

Figure 1 for FlexLip: A Controllable Text-to-Lip System

Figure 2 for FlexLip: A Controllable Text-to-Lip System

Figure 3 for FlexLip: A Controllable Text-to-Lip System

Figure 4 for FlexLip: A Controllable Text-to-Lip System

Abstract:The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model.

* Sensors. 2022; 22(11):4104
* 16 pages, 4 tables, 4 figures

Via

Access Paper or Ask Questions

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Jun 03, 2021

Beata Lorincz, Adriana Stan, Mircea Giurgiu

Figure 1 for An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Figure 2 for An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Figure 3 for An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Figure 4 for An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Abstract:Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all. In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings. We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker's synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers' neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.

* Accepted at 25th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2021)

Via

Access Paper or Ask Questions

Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

Jun 03, 2021

Beata Lorincz, Adriana Stan, Mircea Giurgiu

Figure 1 for Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

Figure 2 for Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

Figure 3 for Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

Figure 4 for Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

Abstract:Building multispeaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness. In order to address this issue, we explore two directions: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods. We show that both methods are efficient when evaluated with both objective and subjective measures. The additional loss term aids the speaker similarity, while the data augmentation improves the intelligibility of the multispeaker TTS system.

* Accepted at EUSIPCO 2021

Via

Access Paper or Ask Questions