Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christophe Cerisara

SYNALP

The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

Mar 15, 2025

Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cerisara, Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré, OpenLLM-France community

Abstract:We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.

Via

Access Paper or Ask Questions

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Dec 28, 2024

Yaya Sy, Christophe Cerisara, Irina Illina

Figure 1 for Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Figure 2 for Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Figure 3 for Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Figure 4 for Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Abstract:Current LLM structured pruning methods typically involve two steps: (1) compression with calibration data and (2) costly continued pretraining on billions of tokens to recover lost performance. This second step is necessary as the first significantly impacts model accuracy. Prior research suggests pretrained Transformer weights aren't inherently low-rank, unlike their activations, which may explain this drop. Based on this observation, we propose Lillama, a compression method that locally distills activations with low-rank weights. Using SVD for initialization and a joint loss combining teacher and student activations, we accelerate convergence and reduce memory use with local gradient updates. Lillama compresses Mixtral-8x7B within minutes on a single A100 GPU, removing 10 billion parameters while retaining over 95% of its original performance. Phi-2 3B can be compressed by 40% with just 13 million calibration tokens, resulting in a small model that competes with recent models of similar size. The method generalizes well to non-transformer architectures, compressing Mamba-3B by 20% while maintaining 99% performance.

* 20 pages, 8 figures

Via

Access Paper or Ask Questions

Large Language Models Compression via Low-Rank Feature Distillation

Dec 21, 2024

Yaya Sy, Christophe Cerisara, Irina Illina

Figure 1 for Large Language Models Compression via Low-Rank Feature Distillation

Figure 2 for Large Language Models Compression via Low-Rank Feature Distillation

Figure 3 for Large Language Models Compression via Low-Rank Feature Distillation

Figure 4 for Large Language Models Compression via Low-Rank Feature Distillation

Abstract:Current LLM structured pruning methods involve two steps: (1) compressing with calibration data and (2) continued pretraining on billions of tokens to recover the lost performance. This costly second step is needed as the first step significantly impacts performance. Previous studies have found that pretrained Transformer weights aren't inherently low-rank, unlike their activations, which may explain this performance drop. Based on this observation, we introduce a one-shot compression method that locally distills low-rank weights. We accelerate convergence by initializing the low-rank weights with SVD and using a joint loss that combines teacher and student activations. We reduce memory requirements by applying local gradient updates only. Our approach can compress Mixtral-8x7B within minutes on a single A100 GPU, removing 10 billion parameters while maintaining over 95% of the original performance. Phi-2 3B can be compressed by 40% using only 13 million calibration tokens into a small model that competes with recent models of similar size. We show our method generalizes well to non-transformer architectures: Mamba-3B can be compressed by 20% while maintaining 99% of its performance.

* 20 pages, 8 figures

Via

Access Paper or Ask Questions

Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning

Aug 30, 2024

Maxime Méloux, Christophe Cerisara

Abstract:Teaching new information to pre-trained large language models (PLM) is a crucial but challenging task. Model adaptation techniques, such as fine-tuning and parameter-efficient training have been shown to store new facts at a slow rate; continual learning is an option but is costly and prone to catastrophic forgetting. This work studies and quantifies how PLM may learn and remember new world knowledge facts that do not occur in their pre-training corpus, which only contains world knowledge up to a certain date. To that purpose, we first propose Novel-WD, a new dataset consisting of sentences containing novel facts extracted from recent Wikidata updates, along with two evaluation tasks in the form of causal language modeling and multiple choice questions (MCQ). We make this dataset freely available to the community, and release a procedure to later build new versions of similar datasets with up-to-date information. We also explore the use of prefix-tuning for novel information learning, and analyze how much information can be stored within a given prefix. We show that a single fact can reliably be encoded within a single prefix, and that the prefix capacity increases with its length and with the base model size.

Via

Access Paper or Ask Questions

A Realistic Evaluation of LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

Jun 17, 2024

Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Abstract:Large Language Models (LLMs) zero-shot and few-shot performance are subject to memorization and data contamination, complicating the assessment of their validity. In literary tasks, the performance of LLMs is often correlated to the degree of book memorization. In this work, we carry out a realistic evaluation of LLMs for quotation attribution in novels, taking the instruction fined-tuned version of Llama3 as an example. We design a task-specific memorization measure and use it to show that Llama3's ability to perform quotation attribution is positively correlated to the novel degree of memorization. However, Llama3 still performs impressively well on books it has not memorized nor seen. Data and code will be made publicly available.

* Paper under review

Via

Access Paper or Ask Questions

Improving Quotation Attribution with Fictional Character Embeddings

Jun 17, 2024

Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Figure 1 for Improving Quotation Attribution with Fictional Character Embeddings

Figure 2 for Improving Quotation Attribution with Fictional Character Embeddings

Figure 3 for Improving Quotation Attribution with Fictional Character Embeddings

Figure 4 for Improving Quotation Attribution with Fictional Character Embeddings

Abstract:Humans naturally attribute utterances of direct speech to their speaker in literary works. When attributing quotes, we process contextual information but also access mental representations of characters that we build and revise throughout the narrative. Recent methods to automatically attribute such utterances have explored simulating human logic with deterministic rules or learning new implicit rules with neural networks when processing contextual information. However, these systems inherently lack \textit{character} representations, which often leads to errors on more challenging examples of attribution: anaphoric and implicit quotes. In this work, we propose to augment a popular quotation attribution system, BookNLP, with character embeddings that encode global information of characters. To build these embeddings, we create DramaCV, a corpus of English drama plays from the 15th to 20th century focused on Character Verification (CV), a task similar to Authorship Verification (AV), that aims at analyzing fictional characters. We train a model similar to the recently proposed AV model, Universal Authorship Representation (UAR), on this dataset, showing that it outperforms concurrent methods of characters embeddings on the CV task and generalizes better to literary novels. Then, through an extensive evaluation on 22 novels, we show that combining BookNLP's contextual information with our proposed global character embeddings improves the identification of speakers for anaphoric and implicit quotes, reaching state-of-the-art performance. Code and data will be made publicly available.

* Paper under review

Via

Access Paper or Ask Questions

Distinguishing Fictional Voices: a Study of Authorship Verification Models for Quotation Attribution

Jan 30, 2024

Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Figure 1 for Distinguishing Fictional Voices: a Study of Authorship Verification Models for Quotation Attribution

Figure 2 for Distinguishing Fictional Voices: a Study of Authorship Verification Models for Quotation Attribution

Figure 3 for Distinguishing Fictional Voices: a Study of Authorship Verification Models for Quotation Attribution

Figure 4 for Distinguishing Fictional Voices: a Study of Authorship Verification Models for Quotation Attribution

Abstract:Recent approaches to automatically detect the speaker of an utterance of direct speech often disregard general information about characters in favor of local information found in the context, such as surrounding mentions of entities. In this work, we explore stylistic representations of characters built by encoding their quotes with off-the-shelf pretrained Authorship Verification models in a large corpus of English novels (the Project Dialogism Novel Corpus). Results suggest that the combination of stylistic and topical information captured in some of these models accurately distinguish characters among each other, but does not necessarily improve over semantic-only models when attributing quotes. However, these results vary across novels and more investigation of stylometric models particularly tailored for literary texts and the study of characters should be conducted.

* Accepted at EACL 2024's workshop LaTeCH-CLfL

Via

Access Paper or Ask Questions

Learning representations with end-to-end models for improved remaining useful life prognostics

Apr 11, 2021

Alaaeddine Chaoub, Alexandre Voisin, Christophe Cerisara, Benoît Iung

Figure 1 for Learning representations with end-to-end models for improved remaining useful life prognostics

Figure 2 for Learning representations with end-to-end models for improved remaining useful life prognostics

Figure 3 for Learning representations with end-to-end models for improved remaining useful life prognostics

Figure 4 for Learning representations with end-to-end models for improved remaining useful life prognostics

Abstract:The remaining Useful Life (RUL) of equipment is defined as the duration between the current time and its failure. An accurate and reliable prognostic of the remaining useful life provides decision-makers with valuable information to adopt an appropriate maintenance strategy to maximize equipment utilization and avoid costly breakdowns. In this work, we propose an end-to-end deep learning model based on multi-layer perceptron and long short-term memory layers (LSTM) to predict the RUL. After normalization of all data, inputs are fed directly to an MLP layers for feature learning, then to an LSTM layer to capture temporal dependencies, and finally to other MLP layers for RUL prognostic. The proposed architecture is tested on the NASA commercial modular aero-propulsion system simulation (C-MAPSS) dataset. Despite its simplicity with respect to other recently proposed models, the model developed outperforms them with a significant decrease in the competition score and in the root mean square error score between the predicted and the gold value of the RUL. In this paper, we will discuss how the proposed end-to-end model is able to achieve such good results and compare it to other deep learning and state-of-the-art methods.

Via

Access Paper or Ask Questions

On the Effects of Using word2vec Representations in Neural Networks for Dialogue Act Recognition

Oct 22, 2020

Christophe Cerisara, Pavel Kral, Ladislav Lenc

Figure 1 for On the Effects of Using word2vec Representations in Neural Networks for Dialogue Act Recognition

Figure 2 for On the Effects of Using word2vec Representations in Neural Networks for Dialogue Act Recognition

Figure 3 for On the Effects of Using word2vec Representations in Neural Networks for Dialogue Act Recognition

Figure 4 for On the Effects of Using word2vec Representations in Neural Networks for Dialogue Act Recognition

Abstract:Dialogue act recognition is an important component of a large number of natural language processing pipelines. Many research works have been carried out in this area, but relatively few investigate deep neural networks and word embeddings. This is surprising, given that both of these techniques have proven exceptionally good in most other language-related domains. We propose in this work a new deep neural network that explores recurrent models to capture word sequences within sentences, and further study the impact of pretrained word embeddings. We validate this model on three languages: English, French and Czech. The performance of the proposed approach is consistent across these languages and it is comparable to the state-of-the-art results in English. More importantly, we confirm that deep neural networks indeed outperform a Maximum Entropy classifier, which was expected. However , and this is more surprising, we also found that standard word2vec em-beddings do not seem to bring valuable information for this task and the proposed model, whatever the size of the training corpus is. We thus further analyse the resulting embeddings and conclude that a possible explanation may be related to the mismatch between the type of lexical-semantic information captured by the word2vec embeddings, and the kind of relations between words that is the most useful for the dialogue act recognition task.

* Computer Speech and Language, Elsevier, 2018, 47, pp.175 - 193

Via

Access Paper or Ask Questions

Cross-lingual Transfer Learning for Dialogue Act Recognition

May 19, 2020

Jiří Martínek, Christophe Cerisara, Pavel Král, Ladislav Lenc

Figure 1 for Cross-lingual Transfer Learning for Dialogue Act Recognition

Figure 2 for Cross-lingual Transfer Learning for Dialogue Act Recognition

Figure 3 for Cross-lingual Transfer Learning for Dialogue Act Recognition

Figure 4 for Cross-lingual Transfer Learning for Dialogue Act Recognition

Abstract:This paper deals with cross-lingual transfer learning for dialogue act (DA) recognition. Besides generic contextual information gathered from pre-trained BERT embeddings, our objective is to transfer models trained on a standard English DA corpus to two other languages, German and French, and to potentially very different types of dialogue with different dialogue acts than the standard well-known DA corpora. The proposed approach thus studies the applicability of automatic DA recognition to specific tasks that may not benefit from a large enough number of manual annotations. A key component of our architecture is the automatic translation module, which limitations are addressed by stacking both foreign and translated words sequences into the same model. We further compare both CNN and multi-head self-attention to compute the speaker turn embeddings and show that in low-resource situations, the best results are obtained by combining all sources of transferred information.

* Submitted for Interspeech 2020

Via

Access Paper or Ask Questions