Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tom Kenter

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Aug 28, 2022

Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang(+2 more)

Figure 1 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Figure 2 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Figure 3 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Figure 4 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Abstract:Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; in particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style.

* To be published in Interspeech 2022

Via

Access Paper or Ask Questions

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Sep 09, 2019

Rob Clark, Hanna Silen, Tom Kenter, Ralph Leith

Figure 1 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Figure 2 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Figure 3 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Figure 4 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Abstract:Text-to-speech systems are typically evaluated on single sentences. When long-form content, such as data consisting of full paragraphs or dialogues is considered, evaluating sentences in isolation is not always appropriate as the context in which the sentences are synthesized is missing. In this paper, we investigate three different ways of evaluating the naturalness of long-form text-to-speech synthesis. We compare the results obtained from evaluating sentences in isolation, evaluating whole paragraphs of speech, and presenting a selection of speech or text as context and evaluating the subsequent speech. We find that, even though these three evaluations are based upon the same material, the outcomes differ per setting, and moreover that these outcomes do not necessarily correlate with each other. We show that our findings are consistent between a single speaker setting of read paragraphs and a two-speaker dialogue scenario. We conclude that to evaluate the quality of long-form speech, the traditional way of evaluating sentences in isolation does not suffice, and that multiple evaluations are required.

* Accepted for The 10th ISCA Speech Synthesis Workshop (SSW10), 6 pages

Via

Access Paper or Ask Questions

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Jun 04, 2019

Vincent Wan, Chun-an Chan, Tom Kenter, Jakub Vit, Rob Clark

Figure 1 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Figure 2 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Figure 3 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Figure 4 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Abstract:The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder and decoder parts of the auto-encoder are hierarchical, in line with the linguistic structure, with layers being clocked dynamically at the respective rates. We show in our experiments that our dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.

Via

Access Paper or Ask Questions

HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

Oct 12, 2018

Hosein Azarbonyad, Mostafa Dehghani, Tom Kenter, Maarten Marx, Jaap Kamps, Maarten de Rijke

Figure 1 for HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

Figure 2 for HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

Figure 3 for HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

Figure 4 for HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

Abstract:A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three distributions for assessing the diversity of documents: distributions of words within documents, words within topics, and topics within documents. Topic models play a central role in this approach and, hence, their quality is crucial to the efficacy of measuring topical diversity. The quality of topic models is affected by two causes: generality and impurity of topics. General topics only include common information of a background corpus and are assigned to most of the documents. Impure topics contain words that are not related to the topic. Impurity lowers the interpretability of topic models. Impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation process aimed at removing generality and impurity. Our approach has three re-estimation components: (1) document re-estimation, which removes general words from the documents; (2) topic re-estimation, which re-estimates the distribution over words of each topic; and (3) topic assignment re-estimation, which re-estimates for each document its distributions over topics. For measuring topical diversity of text documents, our HiTR approach improves over the state-of-the-art measured on PubMed dataset.

* IEEE Transactions on Knowledge and Data Engineering

Via

Access Paper or Ask Questions

Attentive Memory Networks: Efficient Machine Reading for Conversational Search

Dec 19, 2017

Tom Kenter, Maarten de Rijke

Figure 1 for Attentive Memory Networks: Efficient Machine Reading for Conversational Search

Figure 2 for Attentive Memory Networks: Efficient Machine Reading for Conversational Search

Figure 3 for Attentive Memory Networks: Efficient Machine Reading for Conversational Search

Figure 4 for Attentive Memory Networks: Efficient Machine Reading for Conversational Search

Abstract:Recent advances in conversational systems have changed the search paradigm. Traditionally, a user poses a query to a search engine that returns an answer based on its index, possibly leveraging external knowledge bases and conditioning the response on earlier interactions in the search session. In a natural conversation, there is an additional source of information to take into account: utterances produced earlier in a conversation can also be referred to and a conversational IR system has to keep track of information conveyed by the user during the conversation, even if it is implicit. We argue that the process of building a representation of the conversation can be framed as a machine reading task, where an automated system is presented with a number of statements about which it should answer questions. The questions should be answered solely by referring to the statements provided, without consulting external knowledge. The time is right for the information retrieval community to embrace this task, both as a stand-alone task and integrated in a broader conversational search setting. In this paper, we focus on machine reading as a stand-alone task and present the Attentive Memory Network (AMN), an end-to-end trainable machine reading algorithm. Its key contribution is in efficiency, achieved by having an hierarchical input encoder, iterating over the input only once. Speed is an important requirement in the setting of conversational search, as gaps between conversational turns have a detrimental effect on naturalness. On 20 datasets commonly used for evaluating machine reading algorithms we show that the AMN achieves performance comparable to the state-of-the-art models, while using considerably fewer computations.

* Proceedings of 1st International Workshop on Conversational Approaches to Information Retrieval, Tokyo, Japan, August 11, 2017 (CAIR'17)

Via

Access Paper or Ask Questions

Neural Networks for Information Retrieval

Jul 13, 2017

Tom Kenter, Alexey Borisov, Christophe Van Gysel, Mostafa Dehghani, Maarten de Rijke, Bhaskar Mitra

Figure 1 for Neural Networks for Information Retrieval

Abstract:Machine learning plays a role in many aspects of modern IR systems, and deep learning is applied in all of them. The fast pace of modern-day research has given rise to many different approaches for many different IR problems. The amount of information available can be overwhelming both for junior students and for experienced researchers looking for new research topics and directions. Additionally, it is interesting to see what key insights into IR problems the new technologies are able to give us. The aim of this full-day tutorial is to give a clear overview of current tried-and-trusted neural methods in IR and how they benefit IR research. It covers key architectures, as well as the most promising future directions.

* Overview of full-day tutorial at SIGIR 2017

Via

Access Paper or Ask Questions

Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Jun 15, 2016

Tom Kenter, Alexey Borisov, Maarten de Rijke

Figure 1 for Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Figure 2 for Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Figure 3 for Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Figure 4 for Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Abstract:We present the Siamese Continuous Bag of Words (Siamese CBOW) model, a neural network for efficient estimation of high-quality sentence embeddings. Averaging the embeddings of words in a sentence has proven to be a surprisingly successful and efficient way of obtaining sentence embeddings. However, word embeddings trained with the methods currently available are not optimized for the task of sentence representation, and, thus, likely to be suboptimal. Siamese CBOW handles this problem by training word embeddings directly for the purpose of being averaged. The underlying neural network learns word embeddings by predicting, from a sentence representation, its surrounding sentences. We show the robustness of the Siamese CBOW model by evaluating it on 20 datasets stemming from a wide variety of sources.

* Accepted as full paper at ACL 2016, Berlin. 11 pages

Via

Access Paper or Ask Questions