Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rob Clark

Guiding Empowerment Model: Liberating Neurodiversity in Online Higher Education

Oct 24, 2024

Hannah Beaux, Pegah Karimi, Otilia Pop, Rob Clark

Figure 1 for Guiding Empowerment Model: Liberating Neurodiversity in Online Higher Education

Figure 2 for Guiding Empowerment Model: Liberating Neurodiversity in Online Higher Education

Abstract:In this innovative practice full paper, we address the equity gap for neurodivergent and situationally limited learners by identifying the spectrum of dynamic factors that impact learning and function. Educators have shown a growing interest in identifying learners' cognitive abilities and learning preferences to measure their impact on academic achievement. Often institutions employ one-size-fits-all approaches leaving the burden on disabled students to self-advocate or tolerate inadequate support. Emerging frameworks guide neurodivergent learners through instructional approaches, such as online education. However, these frameworks fail to address holistic environmental needs or recommend technology interventions, particularly for those with undisclosed learning or developmental disabilities and situational limitations. In this article, we integrate a neurodivergent perspective through secondary research of around 100 articles to introduce a Guiding Empowerment Model involving key cognitive and situational factors that contextualize day-to-day experiences affecting learner ability. We synthesize three sample student profiles that highlight user problems in functioning. We use this model to evaluate sample learning platform features and other supportive technology solutions. The proposed approach augments frameworks such as Universal Design for Learning to consider factors including various sensory processing differences, social connection challenges, and environmental limitations. We suggest that by applying the mode through technology-enabled features such as customizable task management, guided varied content access, and guided multi-modal collaboration, major learning barriers of neurodivergent and situationally limited learners will be removed to activate the successful pursuit of their academic goals.

* 9 pages, 1 Figure, 1 Table, Accepted in FIE 2024

Via

Access Paper or Ask Questions

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Aug 28, 2022

Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang(+2 more)

Figure 1 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Figure 2 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Figure 3 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Figure 4 for Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Abstract:Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; in particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style.

* To be published in Interspeech 2022

Via

Access Paper or Ask Questions

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Sep 09, 2019

Rob Clark, Hanna Silen, Tom Kenter, Ralph Leith

Figure 1 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Figure 2 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Figure 3 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Figure 4 for Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Abstract:Text-to-speech systems are typically evaluated on single sentences. When long-form content, such as data consisting of full paragraphs or dialogues is considered, evaluating sentences in isolation is not always appropriate as the context in which the sentences are synthesized is missing. In this paper, we investigate three different ways of evaluating the naturalness of long-form text-to-speech synthesis. We compare the results obtained from evaluating sentences in isolation, evaluating whole paragraphs of speech, and presenting a selection of speech or text as context and evaluating the subsequent speech. We find that, even though these three evaluations are based upon the same material, the outcomes differ per setting, and moreover that these outcomes do not necessarily correlate with each other. We show that our findings are consistent between a single speaker setting of read paragraphs and a two-speaker dialogue scenario. We conclude that to evaluate the quality of long-form speech, the traditional way of evaluating sentences in isolation does not suffice, and that multiple evaluations are required.

* Accepted for The 10th ISCA Speech Synthesis Workshop (SSW10), 6 pages

Via

Access Paper or Ask Questions

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Jun 04, 2019

Vincent Wan, Chun-an Chan, Tom Kenter, Jakub Vit, Rob Clark

Figure 1 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Figure 2 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Figure 3 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Figure 4 for CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Abstract:The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder and decoder parts of the auto-encoder are hierarchical, in line with the linguistic structure, with layers being clocked dynamically at the respective rates. We show in our experiments that our dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.

Via

Access Paper or Ask Questions

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Mar 24, 2018

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

Figure 1 for Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Figure 2 for Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Figure 3 for Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Figure 4 for Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Abstract:We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

Via

Access Paper or Ask Questions

Uncovering Latent Style Factors for Expressive Speech Synthesis

Nov 01, 2017

Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous

Figure 1 for Uncovering Latent Style Factors for Expressive Speech Synthesis

Figure 2 for Uncovering Latent Style Factors for Expressive Speech Synthesis

Figure 3 for Uncovering Latent Style Factors for Expressive Speech Synthesis

Abstract:Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.

* Submitted to NIPS ML4Audio workshop and ICASSP

Via

Access Paper or Ask Questions

Tacotron: Towards End-to-End Speech Synthesis

Apr 06, 2017

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio(+4 more)

Figure 1 for Tacotron: Towards End-to-End Speech Synthesis

Figure 2 for Tacotron: Towards End-to-End Speech Synthesis

Figure 3 for Tacotron: Towards End-to-End Speech Synthesis

Figure 4 for Tacotron: Towards End-to-End Speech Synthesis

Abstract:A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

* Submitted to Interspeech 2017. v2 changed paper title to be consistent with our conference submission (no content change other than typo fixes)

Via

Access Paper or Ask Questions