Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ricardo Lopes

GlórIA -- A Generative and Open Large Language Model for Portuguese

Feb 20, 2024

Ricardo Lopes, João Magalhães, David Semedo

Figure 1 for GlórIA -- A Generative and Open Large Language Model for Portuguese

Figure 2 for GlórIA -- A Generative and Open Large Language Model for Portuguese

Figure 3 for GlórIA -- A Generative and Open Large Language Model for Portuguese

Figure 4 for GlórIA -- A Generative and Open Large Language Model for Portuguese

Abstract:Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl\'orIA, a robust European Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl\'orIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.

* Accepted for publication at PROPOR 2024

Via

Access Paper or Ask Questions