Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MC^2: A Multilingual Corpus of Minority Languages in China

Nov 14, 2023

Chen Zhang, Mingxu Tao, Quzhe Huang, Jiuheng Lin, Zhibin Chen, Yansong Feng

Figure 1 for MC^2: A Multilingual Corpus of Minority Languages in China

Figure 2 for MC^2: A Multilingual Corpus of Minority Languages in China

Figure 3 for MC^2: A Multilingual Corpus of Minority Languages in China

Figure 4 for MC^2: A Multilingual Corpus of Minority Languages in China

Share this with someone who'll enjoy it:

Abstract:Large-scale corpora play a vital role in the construction of large language models (LLMs). However, existing LLMs exhibit limited abilities in understanding low-resource languages, including the minority languages in China, due to a lack of training data. To improve the accessibility of these languages, we present MC^2, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus so far. It encompasses four underrepresented languages, i.e., Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script. Notably, two writing systems in MC^2 are long neglected in previous corpora. As we identify serious contamination in the low-resource language split in the existing multilingual corpora, we propose a quality-centric solution for collecting MC^2, prioritizing quality and accuracy while enhancing representativeness and diversity. By in-depth analysis, we demonstrate the new research challenges MC^2 brings, such as long-text modeling and multiplicity of writing systems. We hope MC^2 can help enhance the equity of the underrepresented languages in China and provide a reliable data foundation for further research on low-resource languages.

* Work in progress

View paper on

Share this with someone who'll enjoy it:

Title:MC^2: A Multilingual Corpus of Minority Languages in China

Paper and Code