Abstract:Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.
Abstract:Legal texts, characterized by complex and specialized terminology, present a significant challenge for Language Models. Adding an underrepresented language, such as Spanish, to the mix makes it even more challenging. While pre-trained models like XLM-RoBERTa have shown capabilities in handling multilingual corpora, their performance on domain specific documents remains underexplored. This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large, fine-tuned on legal documents such as BOE (Bolet\'in Oficial del Estado, the Spanish oficial report of laws) and congress texts. We detail the data collection, processing, training, and evaluation processes. Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language. We also present case studies demonstrating the model's application to new legal texts, highlighting its potential to perform top results over different NLP tasks.
Abstract:Legal corpora for Natural Language Processing (NLP) are valuable and scarce resources in languages like Spanish due to two main reasons: data accessibility and legal expert knowledge availability. INESData 2024 is a European Union funded project lead by the Universidad Polit\'ecnica de Madrid (UPM) and developed by Instituto de Ingenier\'ia del Conocimiento (IIC) to create a series of state-of-the-art NLP resources applied to the legal/administrative domain in Spanish. The goal of this paper is to present the Corpus of Legal Spanish Contract Clauses (3CEL), which is a contract information extraction corpus developed within the framework of INESData 2024. 3CEL contains 373 manually annotated tenders using 19 defined categories (4 782 total tags) that identify key information for contract understanding and reviewing.