Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Jan 18, 2022

Hang Jiang, Yining Hua, Doug Beeferman, Deb Roy

Figure 1 for Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Figure 2 for Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Figure 3 for Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Figure 4 for Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Share this with someone who'll enjoy it:

Abstract:Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging, dependency parsing) and NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art NLP models. We first annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train a Stanza NER model on the new benchmark, achieving competitive performance against other non-transformer NER systems. Finally, we train other Twitter NLP models (a tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2 based on Stanza, and achieve state-of-the-art or competitive performance on these tasks. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research. Our source code, data, and pre-trained models are available at: \url{https://github.com/social-machines/TweebankNLP}.

View paper on

Share this with someone who'll enjoy it:

Title:Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Paper and Code