Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Itziar Aldabe

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Jun 09, 2025

Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero, Aimar Zabala, Ekhi Azurmendi, German Rigau, Eneko Agirre(+2 more)

Abstract:Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model, and improved results when scaling up. Using Llama 3.1 instruct 70B as backbone our model comes near frontier models of much larger sizes for Basque, without using any Basque data apart from the 1.2B word corpora. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.

* Under review

Via

Access Paper or Ask Questions

Latxa: An Open Language Model and Evaluation Suite for Basque

Mar 29, 2024

Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

Figure 1 for Latxa: An Open Language Model and Evaluation Suite for Basque

Figure 2 for Latxa: An Open Language Model and Evaluation Suite for Basque

Figure 3 for Latxa: An Open Language Model and Evaluation Suite for Basque

Figure 4 for Latxa: An Open Language Model and Evaluation Suite for Basque

Abstract:We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses at https://github.com/hitz-zentroa/latxa. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

Via

Access Paper or Ask Questions

Does Corpus Quality Really Matter for Low-Resource Languages?

Mar 15, 2022

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

Figure 1 for Does Corpus Quality Really Matter for Low-Resource Languages?

Figure 2 for Does Corpus Quality Really Matter for Low-Resource Languages?

Figure 3 for Does Corpus Quality Really Matter for Low-Resource Languages?

Figure 4 for Does Corpus Quality Really Matter for Low-Resource Languages?

Abstract:The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is primarily constrained by the quantity rather than the quality of the data, prompting for methods to exploit more diverse data sources.

Via

Access Paper or Ask Questions

Multilingual and Cross-lingual Timeline Extraction

Feb 02, 2017

Egoitz Laparra, Rodrigo Agerri, Itziar Aldabe, German Rigau

Figure 1 for Multilingual and Cross-lingual Timeline Extraction

Figure 2 for Multilingual and Cross-lingual Timeline Extraction

Figure 3 for Multilingual and Cross-lingual Timeline Extraction

Figure 4 for Multilingual and Cross-lingual Timeline Extraction

Abstract:In this paper we present an approach to extract ordered timelines of events, their participants, locations and times from a set of multilingual and cross-lingual data sources. Based on the assumption that event-related information can be recovered from different documents written in different languages, we extend the Cross-document Event Ordering task presented at SemEval 2015 by specifying two new tasks for, respectively, Multilingual and Cross-lingual Timeline Extraction. We then develop three deterministic algorithms for timeline extraction based on two main ideas. First, we address implicit temporal relations at document level since explicit time-anchors are too scarce to build a wide coverage timeline extraction system. Second, we leverage several multilingual resources to obtain a single, inter-operable, semantic representation of events across documents and across languages. The result is a highly competitive system that strongly outperforms the current state-of-the-art. Nonetheless, further analysis of the results reveals that linking the event mentions with their target entities and time-anchors remains a difficult challenge. The systems, resources and scorers are freely available to facilitate its use and guarantee the reproducibility of results.

* 20 pages, 7 tables, 7 figures; submitted to Knowledge Based Systems (Elsevier), January, 2017

Via

Access Paper or Ask Questions

Supervised Hierarchical Classification for Student Answer Scoring

Jul 13, 2015

Itziar Aldabe, Oier Lopez de Lacalle, Iñigo Lopez-Gazpio, Montse Maritxalar

Figure 1 for Supervised Hierarchical Classification for Student Answer Scoring

Figure 2 for Supervised Hierarchical Classification for Student Answer Scoring

Figure 3 for Supervised Hierarchical Classification for Student Answer Scoring

Abstract:This paper describes a hierarchical system that predicts one label at a time for automated student response analysis. For the task, we build a classification binary tree that delays more easily confused labels to later stages using hierarchical processes. In particular, the paper describes how the hierarchical classifier has been built and how the classification task has been broken down into binary subtasks. It finally discusses the motivations and fundamentals of such an approach.

* 5 pages with references

Via

Access Paper or Ask Questions