Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Beatrice Daille

Self-Compositional Data Augmentation for Scientific Keyphrase Generation

Nov 05, 2024

Mael Houbre, Florian Boudin, Beatrice Daille, Akiko Aizawa

Abstract:State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a self-compositional data augmentation method. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. The advantage of our method lies in its ability to create additional training samples that keep domain coherence, without relying on external data or resources. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain confirms this improvement towards their representativity property.

* Accepted to JCDL 2024 This version is not the final camera ready version

Via

Access Paper or Ask Questions

Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Feb 26, 2024

Adrien Bazoge, Emmanuel Morin, Beatrice Daille, Pierre-Antoine Gourraud

Figure 1 for Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Figure 2 for Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Figure 3 for Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Figure 4 for Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Abstract:Recently, pretrained language models based on BERT have been introduced for the French biomedical domain. Although these models have achieved state-of-the-art results on biomedical and clinical NLP tasks, they are constrained by a limited input sequence length of 512 tokens, which poses challenges when applied to clinical notes. In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks spanning both biomedical and clinical domains. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform both converting a French biomedical BERT to the Longformer architecture and pre-training a French biomedical Longformer from scratch. The results underscore that long-sequence French biomedical models improve performance across most downstream tasks regardless of sequence length, but BERT based models remain the most efficient for named entity recognition tasks.

Via

Access Paper or Ask Questions

How Important Is Tokenization in French Medical Masked Language Models?

Feb 22, 2024

Yanis Labrak, Adrien Bazoge, Beatrice Daille, Mickael Rouvier, Richard Dufour

Abstract:Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.

* Accepted at LREC-Coling 2024

Via

Access Paper or Ask Questions

DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

Feb 20, 2024

Yanis Labrak, Adrien Bazoge, Oumaima El Khettari, Mickael Rouvier, Pacome Constant dit Beaufils, Natalia Grabar, Beatrice Daille, Solen Quiniou, Emmanuel Morin, Pierre-Antoine Gourraud(+1 more)

Figure 1 for DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

Figure 2 for DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

Figure 3 for DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

Figure 4 for DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

Abstract:The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.

* Accepted at LREC-Coling 2024

Via

Access Paper or Ask Questions

A Large-Scale Dataset for Biomedical Keyphrase Generation

Nov 22, 2022

Mael Houbre, Florian Boudin, Beatrice Daille

Abstract:Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset is available under CC-BY-NC v4.0 license at https://huggingface.co/ datasets/taln-ls2n/kpbiomed.

* Accepted at the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI 2022)

Via

Access Paper or Ask Questions