Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dan Wallach

Language Without Words: A Pointillist Model for Natural Language Processing

Dec 11, 2012

Peiyou Song, Anhei Shu, David Phipps, Dan Wallach, Mohit Tiwari, Jedidiah Crandall, George Luger

Figure 1 for Language Without Words: A Pointillist Model for Natural Language Processing

Figure 2 for Language Without Words: A Pointillist Model for Natural Language Processing

Abstract:This paper explores two separate questions: Can we perform natural language processing tasks without a lexicon?; and, Should we? Existing natural language processing techniques are either based on words as units or use units such as grams only for basic classification tasks. How close can a machine come to reasoning about the meanings of words and phrases in a corpus without using any lexicon, based only on grams? Our own motivation for posing this question is based on our efforts to find popular trends in words and phrases from online Chinese social media. This form of written Chinese uses so many neologisms, creative character placements, and combinations of writing systems that it has been dubbed the "Martian Language." Readers must often use visual queues, audible queues from reading out loud, and their knowledge and understanding of current events to understand a post. For analysis of popular trends, the specific problem is that it is difficult to build a lexicon when the invention of new ways to refer to a word or concept is easy and common. For natural language processing in general, we argue in this paper that new uses of language in social media will challenge machines' abilities to operate with words as the basic unit of understanding, not only in Chinese but potentially in other languages.

* The 6th International Conference on Soft Computing and Intelligent Systems (SCIS-ISIS 2012) Kobe, Japan
* 5 pages, 2 figures

Via

Access Paper or Ask Questions

A Pointillism Approach for Natural Language Processing of Social Media

Jun 21, 2012

Peiyou Song, Anhei Shu, Anyu Zhou, Dan Wallach, Jedidiah R. Crandall

Figure 1 for A Pointillism Approach for Natural Language Processing of Social Media

Figure 2 for A Pointillism Approach for Natural Language Processing of Social Media

Figure 3 for A Pointillism Approach for Natural Language Processing of Social Media

Figure 4 for A Pointillism Approach for Natural Language Processing of Social Media

Abstract:The Chinese language poses challenges for natural language processing based on the unit of a word even for formal uses of the Chinese language, social media only makes word segmentation in Chinese even more difficult. In this document we propose a pointillism approach to natural language processing. Rather than words that have individual meanings, the basic unit of a pointillism approach is trigrams of characters. These grams take on meaning in aggregate when they appear together in a way that is correlated over time. Our results from three kinds of experiments show that when words and topics do have a meme-like trend, they can be reconstructed from only trigrams. For example, for 4-character idioms that appear at least 99 times in one day in our data, the unconstrained precision (that is, precision that allows for deviation from a lexicon when the result is just as correct as the lexicon version of the word or phrase) is 0.93. For longer words and phrases collected from Wiktionary, including neologisms, the unconstrained precision is 0.87. We consider these results to be very promising, because they suggest that it is feasible for a machine to reconstruct complex idioms, phrases, and neologisms with good precision without any notion of words. Thus the colorful and baroque uses of language that typify social media in challenging languages such as Chinese may in fact be accessible to machines.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions