Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lawrence Wolf-Sonkin

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

Jul 02, 2020

Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall

Figure 1 for Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

Figure 2 for Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

Figure 3 for Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

Figure 4 for Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

Abstract:This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages

* Published at LREC 2020

Via

Access Paper or Ask Questions

On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs

May 03, 2020

Adina Williams, Ryan Cotterell, Lawrence Wolf-Sonkin, Damián Blasi, Hanna Wallach

Figure 1 for On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs

Figure 2 for On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs

Figure 3 for On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs

Figure 4 for On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs

Abstract:We use large-scale corpora in six different gendered languages, along with tools from NLP and information theory, to test whether there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns. For all six languages, we find that there is a statistically significant relationship. We also find that there are statistically significant relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. We defer a deeper investigation of these relationships for future work.

* 17 pages, 6 figures, 4 tables, TACL(a) final submission

Via

Access Paper or Ask Questions

Quantifying the Semantic Core of Gender Systems

Oct 29, 2019

Adina Williams, Ryan Cotterell, Lawrence Wolf-Sonkin, Damián Blasi, Hanna Wallach

Figure 1 for Quantifying the Semantic Core of Gender Systems

Figure 2 for Quantifying the Semantic Core of Gender Systems

Figure 3 for Quantifying the Semantic Core of Gender Systems

Abstract:Many of the world's languages employ grammatical gender on the lexeme. For example, in Spanish, the word for 'house' (casa) is feminine, whereas the word for 'paper' (papel) is masculine. To a speaker of a genderless language, this assignment seems to exist with neither rhyme nor reason. But is the assignment of inanimate nouns to grammatical genders truly arbitrary? We present the first large-scale investigation of the arbitrariness of noun-gender assignments. To that end, we use canonical correlation analysis to correlate the grammatical gender of inanimate nouns with an externally grounded definition of their lexical semantics. We find that 18 languages exhibit a significant correlation between grammatical gender and lexical semantics.

* 6 pages, 2 figures, accepted to EMNLP 2019

Via

Access Paper or Ask Questions

The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

Oct 25, 2019

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sebastian J. Mielke, Jeffrey Heinz(+2 more)

Figure 1 for The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

Figure 2 for The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

Figure 3 for The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

Figure 4 for The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

Abstract:The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years' inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low-resource language. This year also presents a new second challenge on lemmatization and morphological feature analysis in context. All submissions featured a neural component and built on either this year's strong baselines or highly ranked systems from previous years' shared tasks. Every participating team improved in accuracy over the baselines for the inflection task (though not Levenshtein distance), and every team in the contextual analysis task improved on both state-of-the-art neural and non-neural baselines.

* Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (2019) 229-244
* Presented at SIGMORPHON 2019

Via

Access Paper or Ask Questions

Combining Sentiment Lexica with a Multi-View Variational Autoencoder

Apr 05, 2019

Alexander Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Ryan Cotterell, Isabelle Augenstein

Figure 1 for Combining Sentiment Lexica with a Multi-View Variational Autoencoder

Figure 2 for Combining Sentiment Lexica with a Multi-View Variational Autoencoder

Figure 3 for Combining Sentiment Lexica with a Multi-View Variational Autoencoder

Figure 4 for Combining Sentiment Lexica with a Multi-View Variational Autoencoder

Abstract:When assigning quantitative labels to a dataset, different methodologies may rely on different scales. In particular, when assigning polarities to words in a sentiment lexicon, annotators may use binary, categorical, or continuous labels. Naturally, it is of interest to unify these labels from disparate scales to both achieve maximal coverage over words and to create a single, more robust sentiment lexicon while retaining scale coherence. We introduce a generative model of sentiment lexica to combine disparate scales into a common latent representation. We realize this model with a novel multi-view variational autoencoder (VAE), called SentiVAE. We evaluate our approach via a downstream text classification task involving nine English-Language sentiment analysis datasets; our representation outperforms six individual sentiment lexica, as well as a straightforward combination thereof.

* To appear in NAACL-HLT 2019

Via

Access Paper or Ask Questions

A Structured Variational Autoencoder for Contextual Morphological Inflection

Jun 10, 2018

Lawrence Wolf-Sonkin, Jason Naradowsky, Sebastian J. Mielke, Ryan Cotterell

Figure 1 for A Structured Variational Autoencoder for Contextual Morphological Inflection

Figure 2 for A Structured Variational Autoencoder for Contextual Morphological Inflection

Figure 3 for A Structured Variational Autoencoder for Contextual Morphological Inflection

Figure 4 for A Structured Variational Autoencoder for Contextual Morphological Inflection

Abstract:Statistical morphological inflectors are typically trained on fully supervised, type-level data. One remaining open research question is the following: How can we effectively exploit raw, token-level data to improve their performance? To this end, we introduce a novel generative latent-variable model for the semi-supervised learning of inflection generation. To enable posterior inference over the latent variables, we derive an efficient variational inference procedure based on the wake-sleep algorithm. We experiment on 23 languages, using the Universal Dependencies corpora in a simulated low-resource setting, and find improvements of over 10% absolute accuracy in some cases.

* Published at ACL 2018

Via

Access Paper or Ask Questions