Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Caio Corro

EuroBERT: Scaling Multilingual Encoders for European Languages

Mar 07, 2025

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan(+9 more)

Abstract:General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.

* 26 pages, 6 figures, 11 tables

Via

Access Paper or Ask Questions

Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain

Dec 12, 2024

Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro

Abstract:Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.

* Coling 2025 workshop (FinNLP)

Via

Access Paper or Ask Questions

Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection

Nov 30, 2024

Ayoub Hammal, Benno Uthayasooriyar, Caio Corro

Abstract:Named-entity recognition (NER) is a task that typically requires large annotated datasets, which limits its applicability across domains with varying entity definitions. This paper addresses few-shot NER, aiming to transfer knowledge to new domains with minimal supervision. Unlike previous approaches that rely solely on limited annotated data, we propose a weakly supervised algorithm that combines small labeled datasets with large amounts of unlabeled data. Our method extends the k-means algorithm with label supervision, cluster size constraints and domain-specific discriminative subspace selection. This unified framework achieves state-of-the-art results in few-shot NER on several English datasets.

* COLING 2025

Via

Access Paper or Ask Questions

A fast and sound tagging method for discontinuous named-entity recognition

Sep 24, 2024

Caio Corro

Abstract:We introduce a novel tagging scheme for discontinuous named entity recognition based on an explicit description of the inner structure of discontinuous mentions. We rely on a weighted finite state automaton for both marginal and maximum a posteriori inference. As such, our method is sound in the sense that (1) well-formedness of predicted tag sequences is ensured via the automaton structure and (2) there is an unambiguous mapping between well-formed sequences of tags and (discontinuous) mentions. We evaluate our approach on three English datasets in the biomedical domain, and report comparable results to state-of-the-art while having a way simpler and faster model.

* EMNLP 2024

Via

Access Paper or Ask Questions

Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Mar 26, 2024

Santiago Herrera, Caio Corro, Sylvain Kahane

Figure 1 for Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Figure 2 for Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Figure 3 for Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Figure 4 for Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Abstract:Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model's results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.

* Published in LREC-Coling 2024 proceedings

Via

Access Paper or Ask Questions

SaulLM-7B: A pioneering Large Language Model for Law

Mar 07, 2024

Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado(+1 more)

Abstract:In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SaulLM-7B's performance in legal tasks. SaulLM-7B is released under the MIT License.

Via

Access Paper or Ask Questions

CroissantLLM: A Truly Bilingual French-English Language Model

Feb 02, 2024

Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins(+6 more)

Figure 1 for CroissantLLM: A Truly Bilingual French-English Language Model

Figure 2 for CroissantLLM: A Truly Bilingual French-English Language Model

Figure 3 for CroissantLLM: A Truly Bilingual French-English Language Model

Figure 4 for CroissantLLM: A Truly Bilingual French-English Language Model

Abstract:We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

Via

Access Paper or Ask Questions

Structural generalization in COGS: Supertagging is (almost) all you need

Oct 21, 2023

Alban Petit, Caio Corro, François Yvon

Figure 1 for Structural generalization in COGS: Supertagging is (almost) all you need

Figure 2 for Structural generalization in COGS: Supertagging is (almost) all you need

Figure 3 for Structural generalization in COGS: Supertagging is (almost) all you need

Figure 4 for Structural generalization in COGS: Supertagging is (almost) all you need

Abstract:In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples. In particular, several recent semantic parsing datasets have put forward important limitations of neural networks in cases where compositional generalization is required. In this work, we extend a neural graph-based semantic parsing framework in several ways to alleviate this issue. Notably, we propose: (1) the introduction of a supertagging step with valency constraints, expressed as an integer linear program; (2) a reduction of the graph prediction problem to the maximum matching problem; (3) the design of an incremental early-stopping training strategy to prevent overfitting. Experimentally, our approach significantly improves results on examples that require structural generalization in the COGS dataset, a known challenging benchmark for compositional generalization. Overall, our results confirm that structural constraints are important for generalization in semantic parsing.

* accepted at EMNLP 2023

Via

Access Paper or Ask Questions

On graph-based reentrancy-free semantic parsing

Feb 15, 2023

Alban Petit, Caio Corro

Abstract:We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on Geoquery, Scan and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.

* This work has been accepted for publication in TACL. This version is a pre-MIT Press publication version

Via

Access Paper or Ask Questions

On the inconsistency of separable losses for structured prediction

Jan 25, 2023

Caio Corro

Abstract:In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.

* Preprint, to appear in proc. of EACL 2023

Via

Access Paper or Ask Questions