Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Richard Sproat

Bell Laboratories

BiPhone: Modeling Inter Language Phonetic Influences in Text

Jul 06, 2023

Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James S. Ren, Ambarish Jash, Sukhdeep S. Sodhi, Aravindan Raghuveer

Abstract:A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

* Accepted at ACL 2023

Via

Access Paper or Ask Questions

Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency

Jun 07, 2023

Shigeki Karita, Richard Sproat, Haruko Ishikawa

Figure 1 for Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency

Figure 2 for Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency

Figure 3 for Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency

Abstract:Word error rate (WER) and character error rate (CER) are standard metrics in Speech Recognition (ASR), but one problem has always been alternative spellings: If one's system transcribes adviser whereas the ground truth has advisor, this will count as an error even though the two spellings really represent the same word. Japanese is notorious for ``lacking orthography'': most words can be spelled in multiple ways, presenting a problem for accurate ASR evaluation. In this paper we propose a new lenient evaluation metric as a more defensible CER measure for Japanese ASR. We create a lattice of plausible respellings of the reference transcription, using a combination of lexical resources, a Japanese text-processing system, and a neural machine translation model for reconstructing kanji from hiragana or katakana. In a manual evaluation, raters rated 95.4% of the proposed spelling variants as plausible. ASR results show that our method, which does not penalize the system for choosing a valid alternate spelling of a word, affords a 2.4%-3.1% absolute reduction in CER depending on the task.

* ACL Workshop on Computation and Written Language (CAWL) 2023

Via

Access Paper or Ask Questions

Beyond Arabic: Software for Perso-Arabic Script Manipulation

Jan 26, 2023

Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat

Abstract:This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for eleven contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people.

* Preprint to appear in the Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP 2022) at EMNLP, Abu Dhabi, United Arab Emirates, December 7-11, 2022. 7 pages

Via

Access Paper or Ask Questions

Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation

Oct 18, 2022

Llion Jones, Richard Sproat, Haruko Ishikawa, Alexander Gutkin

Figure 1 for Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation

Figure 2 for Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation

Figure 3 for Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation

Figure 4 for Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation

Abstract:If one sees the place name Houston Mercer Dog Run in New York, how does one know how to pronounce it? Assuming one knows that Houston in New York is pronounced "how-ston" and not like the Texas city, then one can probably guess that "how-ston" is also used in the name of the dog park. We present a novel architecture that learns to use the pronunciations of neighboring names in order to guess the pronunciation of a given target feature. Applied to Japanese place names, we demonstrate the utility of the model to finding and proposing corrections for errors in Google Maps. To demonstrate the utility of this approach to structurally similar problems, we also report on an application to a totally different task: Cognate reflex prediction in comparative historical linguistics. A version of the code has been open-sourced (https://github.com/google-research/google-research/tree/master/cognate_inpaint_neighbors).

* 16 pages, to appear Transactions of the Association for Computational Linguistics

Via

Access Paper or Ask Questions

Structured abbreviation expansion in context

Oct 04, 2021

Kyle Gorman, Christo Kirov, Brian Roark, Richard Sproat

Figure 1 for Structured abbreviation expansion in context

Figure 2 for Structured abbreviation expansion in context

Figure 3 for Structured abbreviation expansion in context

Figure 4 for Structured abbreviation expansion in context

Abstract:Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The problem is related to, but distinct from, spelling correction, in that ad hoc abbreviations are intentional and may involve substantial differences from the original words. Ad hoc abbreviations are productively generated on-the-fly, so they cannot be resolved solely by dictionary lookup. We generate a large, open-source data set of ad hoc abbreviations. This data is used to study abbreviation strategies and to develop two strong baselines for abbreviation expansion

* Accepted to Findings of EMNLP 2021

Via

Access Paper or Ask Questions

Semi-supervised URL Segmentation with Recurrent Neural NetworksPre-trained on Knowledge Graph Entities

Nov 05, 2020

Hao Zhang, Jae Ro, Richard Sproat

Figure 1 for Semi-supervised URL Segmentation with Recurrent Neural NetworksPre-trained on Knowledge Graph Entities

Figure 2 for Semi-supervised URL Segmentation with Recurrent Neural NetworksPre-trained on Knowledge Graph Entities

Figure 3 for Semi-supervised URL Segmentation with Recurrent Neural NetworksPre-trained on Knowledge Graph Entities

Figure 4 for Semi-supervised URL Segmentation with Recurrent Neural NetworksPre-trained on Knowledge Graph Entities

Abstract:Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%.

Via

Access Paper or Ask Questions

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Oct 14, 2020

Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson(+11 more)

Abstract:This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.

* Appeared in 2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4-6 December, Paris, France

Via

Access Paper or Ask Questions

NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task

Oct 12, 2020

Alexander Gutkin, Richard Sproat

Figure 1 for NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task

Figure 2 for NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task

Figure 3 for NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task

Figure 4 for NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task

Abstract:This paper describes the NEMO submission to SIGTYP 2020 shared task which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features. We describe two submitted ridge regression-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the micro-averaged accuracy score of 0.66 on 149 test languages.

* To appear in Second Workshop on Computational Research in Linguistic Typology (SIGTYP 2020) at 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Via

Access Paper or Ask Questions

Automatic Ambiguity Detection

May 28, 2019

Richard Sproat, Jan van Santen

Figure 1 for Automatic Ambiguity Detection

Figure 2 for Automatic Ambiguity Detection

Figure 3 for Automatic Ambiguity Detection

Abstract:Most work on sense disambiguation presumes that one knows beforehand -- e.g. from a thesaurus -- a set of polysemous terms. But published lists invariably give only partial coverage. For example, the English word tan has several obvious senses, but one may overlook the abbreviation for tangent. In this paper, we present an algorithm for identifying interesting polysemous terms and measuring their degree of polysemy, given an unlabeled corpus. The algorithm involves: (i) collecting all terms within a k-term window of the target term; (ii) computing the inter-term distances of the contextual terms, and reducing the multi-dimensional distance space to two dimensions using standard methods; (iii) converting the two-dimensional representation into radial coordinates and using isotonic/antitonic regression to compute the degree to which the distribution deviates from a single-peak model. The amount of deviation is the proposed polysemy index

* International Conference on Spoken Language Processing, 1998

Via

Access Paper or Ask Questions

RNN Approaches to Text Normalization: A Challenge

Jan 24, 2017

Richard Sproat, Navdeep Jaitly

Figure 1 for RNN Approaches to Text Normalization: A Challenge

Figure 2 for RNN Approaches to Text Normalization: A Challenge

Figure 3 for RNN Approaches to Text Normalization: A Challenge

Figure 4 for RNN Approaches to Text Normalization: A Challenge

Abstract:This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system. This data set will be released open-source in the near future. We also present our own experiments with this data set with a variety of different RNN architectures. While some of the architectures do in fact produce very good results when measured in terms of overall accuracy, the errors that are produced are problematic, since they would convey completely the wrong message if such a system were deployed in a speech application. On the other hand, we show that a simple FST-based filter can mitigate those errors, and achieve a level of accuracy not achievable by the RNN alone. Though our conclusions are largely negative on this point, we are actually not arguing that the text normalization problem is intractable using an pure RNN approach, merely that it is not going to be something that can be solved merely by having huge amounts of annotated text data and feeding that to a general RNN model. And when we open-source our data, we will be providing a novel data set for sequence-to-sequence modeling in the hopes that the the community can find better solutions. The data used in this work have been released and are available at: https://github.com/rwsproat/text-normalization-data

* 17 pages, 13 tables, 3 figures

Via

Access Paper or Ask Questions