Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Dec 20, 2022

Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan

Figure 1 for Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Figure 2 for Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Figure 3 for Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Figure 4 for Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Share this with someone who'll enjoy it:

Abstract:We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.

* 14 pages, 5 figures, Work in Progress

View paper on

Share this with someone who'll enjoy it:

Title:Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Paper and Code