Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Sep 26, 2024

Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann(+1 more)

Figure 1 for EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Figure 2 for EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Figure 3 for EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Figure 4 for EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Share this with someone who'll enjoy it:

Abstract:In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.

View paper on

Share this with someone who'll enjoy it:

Title:EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Paper and Code