Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

José Cañete

Spanish Pre-trained BERT Model and Evaluation Data

Aug 06, 2023

José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, Jorge Pérez

Abstract:The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.

* Published as workshop paper at Practical ML for Developing Countries Workshop @ ICLR 2020

Via

Access Paper or Ask Questions

ALBETO and DistilBETO: Lightweight Spanish Language Models

Apr 19, 2022

José Cañete, Sebastián Donoso, Felipe Bravo-Marquez, Andrés Carvallo, Vladimir Araujo

Figure 1 for ALBETO and DistilBETO: Lightweight Spanish Language Models

Figure 2 for ALBETO and DistilBETO: Lightweight Spanish Language Models

Figure 3 for ALBETO and DistilBETO: Lightweight Spanish Language Models

Figure 4 for ALBETO and DistilBETO: Lightweight Spanish Language Models

Abstract:In recent years there have been considerable advances in pre-trained language models, where non-English language versions have also been made available. Due to their increasing use, many lightweight versions of these models (with reduced parameters) have also been released to speed up training and inference times. However, versions of these lighter models (e.g., ALBERT, DistilBERT) for languages other than English are still scarce. In this paper we present ALBETO and DistilBETO, which are versions of ALBERT and DistilBERT pre-trained exclusively on Spanish corpora. We train several versions of ALBETO ranging from 5M to 223M parameters and one of DistilBETO with 67M parameters. We evaluate our models in the GLUES benchmark that includes various natural language understanding tasks in Spanish. The results show that our lightweight models achieve competitive results to those of BETO (Spanish-BERT) despite having fewer parameters. More specifically, our larger ALBETO model outperforms all other models on the MLDoc, PAWS-X, XNLI, MLQA, SQAC and XQuAD datasets. However, BETO remains unbeaten for POS and NER. As a further contribution, all models are publicly available to the community for future research.

* Accepted paper at LREC2022

Via

Access Paper or Ask Questions

Evaluation Benchmarks for Spanish Sentence Representations

Apr 15, 2022

Vladimir Araujo, Andrés Carvallo, Souvik Kundu, José Cañete, Marcelo Mendoza, Robert E. Mercer, Felipe Bravo-Marquez, Marie-Francine Moens, Alvaro Soto

Figure 1 for Evaluation Benchmarks for Spanish Sentence Representations

Figure 2 for Evaluation Benchmarks for Spanish Sentence Representations

Figure 3 for Evaluation Benchmarks for Spanish Sentence Representations

Figure 4 for Evaluation Benchmarks for Spanish Sentence Representations

Abstract:Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models' quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela, 2018; Chen et al., 2019), we introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations, respectively. Our benchmarks include considerable pre-existing and newly constructed datasets that address different tasks from various domains. In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations. As an example, we discover that for the case of discourse evaluation tasks, mBERT, a language model trained on multiple languages, usually provides a richer latent representation than models trained only with documents in Spanish. We hope our contribution will motivate a fairer, more comparable, and less cumbersome way to evaluate future Spanish language models.

* Accepted paper at LREC2022

Via

Access Paper or Ask Questions