Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hadar Shemtov

SMOL: Professionally translated parallel data for 115 under-represented languages

Feb 17, 2025

Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry, Geza Kovacs, Hadar Shemtov, Partha Talukdar, Dinesh Tewari, Baba Mamadi Diane, Koulako Moussa Doumbouya(+2 more)

Abstract:We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock translation for low-resource languages (LRLs). SMOL has been translated into 115 under-resourced languages, including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOL-Sent, a set of sentences chosen for broad unique token coverage, and SMOL-Doc, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust ChrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOL-Doc, yielding the first factuality datasets for most of these languages.

* ~10 pages with appendices

Via

Access Paper or Ask Questions

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Feb 29, 2024

Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

Figure 1 for Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Figure 2 for Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Figure 3 for Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Figure 4 for Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Abstract:Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.

* To appear in ICASSP 2024

Via

Access Paper or Ask Questions

Exploring Conversational Language Generation for Rich Content about Hotels

May 01, 2018

Marilyn A. Walker, Albry Smither, Shereen Oraby, Vrindavan Harrison, Hadar Shemtov

Figure 1 for Exploring Conversational Language Generation for Rich Content about Hotels

Figure 2 for Exploring Conversational Language Generation for Rich Content about Hotels

Figure 3 for Exploring Conversational Language Generation for Rich Content about Hotels

Figure 4 for Exploring Conversational Language Generation for Rich Content about Hotels

Abstract:Dialogue systems for hotel and tourist information have typically simplified the richness of the domain, focusing system utterances on only a few selected attributes such as price, location and type of rooms. However, much more content is typically available for hotels, often as many as 50 distinct instantiated attributes for an individual entity. New methods are needed to use this content to generate natural dialogues for hotel information, and in general for any domain with such rich complex content. We describe three experiments aimed at collecting data that can inform an NLG for hotels dialogues, and show, not surprisingly, that the sentences in the original written hotel descriptions provided on webpages for each hotel are stylistically not a very good match for conversational interaction. We quantify the stylistic features that characterize the differences between the original textual data and the collected dialogic data. We plan to use these in stylistic models for generation, and for scoring retrieved utterances for use in hotel dialogues

* This version contains updates to the version published at LREC '18

Via

Access Paper or Ask Questions