Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Itziar Gonzalez-Dios

Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

Jun 18, 2025

Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

Abstract:In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.

Via

Access Paper or Ask Questions

LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations

Jun 06, 2025

Belén Agüera-Marco, Itziar Gonzalez-Dios

Abstract:In this work, we present LengClaro2023, a dataset of legal-administrative texts in Spanish. Based on the most frequently used procedures from the Spanish Social Security website, we have created for each text two simplified equivalents. The first version follows the recommendations provided by arText claro. The second version incorporates additional recommendations from plain language guidelines to explore further potential improvements in the system. The linguistic resource created in this work can be used for evaluating automatic text simplification (ATS) systems in Spanish.

* In this report, we present a part of the master thesis written by Bel\'en Ag\"uera Marco in order to obtain the B.S. Language Analysis and Processing at the University of the Basque Country (UPV/EHU), supervised by Itziar Gonzalez-Dios

Via

Access Paper or Ask Questions

A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models

May 17, 2024

Francesca De Luca Fornaciari, Begoña Altuna, Itziar Gonzalez-Dios, Maite Melero

Abstract:In this work, we explore idiomatic language processing with Large Language Models (LLMs). We introduce the Idiomatic language Test Suite IdioTS, a new dataset of difficult examples specifically designed by language experts to assess the capabilities of LLMs to process figurative language at sentence level. We propose a comprehensive evaluation methodology based on an idiom detection task, where LLMs are prompted with detecting an idiomatic expression in a given English sentence. We present a thorough automatic and manual evaluation of the results and an extensive error analysis.

Via

Access Paper or Ask Questions

This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

Oct 24, 2023

Iker García-Ferrero, Begoña Altuna, Javier Álvez, Itziar Gonzalez-Dios, German Rigau

Abstract:Although large language models (LLMs) have apparently acquired a certain level of grammatical knowledge and the ability to make generalizations, they fail to interpret negation, a crucial step in Natural Language Processing. We try to clarify the reasons for the sub-optimal performance of LLMs understanding negation. We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms. We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability and we have also fine-tuned some of the models to assess whether the understanding of negation can be trained. Our findings show that, while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues. Although fine-tuning the models on negative sentences improves their performance, the lack of generalization in handling negation is persistent, highlighting the ongoing challenges of LLMs regarding negation understanding and generalization. The dataset and code are publicly available.

* Accepted in the The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

Via

Access Paper or Ask Questions

Easy-to-Read in Germany: A Survey on its Current State and Available Resources

Jun 05, 2023

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

Abstract:Easy-to-Read Language (E2R) is a controlled language variant that makes any written text more accessible through the use of clear, direct and simple language. It is mainly aimed at people with cognitive or intellectual disabilities, among other target users. Plain Language (PL), on the other hand, is a variant of a given language, which aims to promote the use of simple language to communicate information. German counts with Leichte Sprache (LS), its version of E2R, and Einfache Sprache (ES), its version of PL. In recent years, important developments have been conducted in the field of LS. This paper offers an updated overview of the existing Natural Language Processing (NLP) tools and resources for LS. Besides, it also aims to set out the situation with regard to LS and ES in Germany.

* 10th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, 2023

Via

Access Paper or Ask Questions

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Mar 07, 2023

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen(+44 more)

Figure 1 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Figure 2 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Figure 3 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Figure 4 for The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Abstract:As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

* NeurIPS 2022, Datasets and Benchmarks Track

Via

Access Paper or Ask Questions

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Nov 09, 2022

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé(+380 more)

Abstract:Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

Via

Access Paper or Ask Questions

Noisy Channel for Automatic Text Simplification

Nov 06, 2022

Oscar M Cumbicus-Pineda, Iker Gutiérrez-Fandiño, Itziar Gonzalez-Dios, Aitor Soroa

Abstract:In this paper we present a simple re-ranking method for Automatic Sentence Simplification based on the noisy channel scheme. Instead of directly computing the best simplification given a complex text, the re-ranking method also considers the probability of the simple sentence to produce the complex counterpart, as well as the probability of the simple text itself, according to a language model. Our experiments show that combining these scores outperform the original system in three different English datasets, yielding the best known result in one of them. Adopting the noisy channel scheme opens new ways to infuse additional information into ATS systems, and thus to control important aspects of them, a known limitation of end-to-end neural seq2seq generative models.

* 8 pages

Via

Access Paper or Ask Questions

Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning

May 03, 2022

Oscar Sainz, Itziar Gonzalez-Dios, Oier Lopez de Lacalle, Bonan Min, Eneko Agirre

Figure 1 for Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning

Figure 2 for Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning

Figure 3 for Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning

Figure 4 for Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning

Abstract:Recent work has shown that NLP tasks such as Relation Extraction (RE) can be recasted as Textual Entailment tasks using verbalizations, with strong performance in zero-shot and few-shot settings thanks to pre-trained entailment models. The fact that relations in current RE datasets are easily verbalized casts doubts on whether entailment would be effective in more complex tasks. In this work we show that entailment is also effective in Event Argument Extraction (EAE), reducing the need of manual annotation to 50% and 20% in ACE and WikiEvents respectively, while achieving the same performance as with full training. More importantly, we show that recasting EAE as entailment alleviates the dependency on schemas, which has been a road-block for transferring annotations between domains. Thanks to the entailment, the multi-source transfer between ACE and WikiEvents further reduces annotation down to 10% and 5% (respectively) of the full training without transfer. Our analysis shows that the key to good results is the use of several entailment datasets to pre-train the entailment model. Similar to previous approaches, our method requires a small amount of effort for manual verbalization: only less than 15 minutes per event argument type is needed, and comparable results can be achieved with users with different level of expertise.

* Accepted as Findings of NAACL2022

Via

Access Paper or Ask Questions

MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment

Sep 10, 2021

Kepa Bengoetxea, Itziar Gonzalez-Dios

Figure 1 for MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment

Figure 2 for MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment

Figure 3 for MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment

Figure 4 for MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment

Abstract:Readability assessment is the task of determining how difficult or easy a text is or which level/grade it has. Traditionally, language dependent readability formula have been used, but these formulae take few text characteristics into account. However, Natural Language Processing (NLP) tools that assess the complexity of texts are able to measure more different features and can be adapted to different languages. In this paper, we present the MultiAzterTest tool: (i) an open source NLP tool which analyzes texts on over 125 measures of cohesion,language, and readability for English, Spanish and Basque, but whose architecture is designed to easily adapt other languages; (ii) readability assessment classifiers that improve the performance of Coh-Metrix in English, Coh-Metrix-Esp in Spanish and ErreXail in Basque; iii) a web tool. MultiAzterTest obtains 90.09 % in accuracy when classifying into three reading levels (elementary, intermediate, and advanced) in English and 95.50 % in Basque and 90 % in Spanish when classifying into two reading levels (simple and complex) using a SMO classifier. Using cross-lingual features, MultiAzterTest also obtains competitive results above all in a complex vs simple distinction.

* 33 pages

Via

Access Paper or Ask Questions