Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gokul N. C.

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Apr 30, 2020

Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

Figure 1 for AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Figure 2 for AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Figure 3 for AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Figure 4 for AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Abstract:We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

* 7 pages, 8 tables, https://github.com/ai4bharat-indicnlp/indicnlp_corpus

Via

Access Paper or Ask Questions