Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Modeling the Unigram Distribution

Jun 04, 2021

Irene Nikkarinen, Tiago Pimentel, Damián E. Blasi, Ryan Cotterell

Figure 1 for Modeling the Unigram Distribution

Figure 2 for Modeling the Unigram Distribution

Figure 3 for Modeling the Unigram Distribution

Figure 4 for Modeling the Unigram Distribution

Share this with someone who'll enjoy it:

Abstract:The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution -- claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the na\"ive use of neural character-level language models.

* Irene Nikkarinen and Tiago Pimentel contributed equally to this work. Accepted to the findings of ACL 2021. Code available in https://github.com/irenenikk/modelling-unigram

View paper on

Share this with someone who'll enjoy it:

Title:Modeling the Unigram Distribution

Paper and Code