Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Dec 02, 2024

Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu

Figure 1 for SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Figure 2 for SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Figure 3 for SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Figure 4 for SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Share this with someone who'll enjoy it:

Abstract:In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public.

* code: https://github.com/sail-sg/sailcompass

View paper on

Share this with someone who'll enjoy it:

Title:SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Paper and Code