Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Richard Dufour

LS2N - équipe TALN

Identifying Reliable Evaluation Metrics for Scientific Text Revision

Jun 06, 2025

Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez

Abstract:Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

* V1 contains only the English version, accepted to ACL 2025 main (26 pages). V2 contains both English (ACL 2025) and French (TALN 2025) versions (58 pages)

Via

Access Paper or Ask Questions

A Benchmark of French ASR Systems Based on Error Severity

Jan 18, 2025

Antoine Tholly, Jane Wottawa, Mickael Rouvier, Richard Dufour

Figure 1 for A Benchmark of French ASR Systems Based on Error Severity

Figure 2 for A Benchmark of French ASR Systems Based on Error Severity

Abstract:Automatic Speech Recognition (ASR) transcription errors are commonly assessed using metrics that compare them with a reference transcription, such as Word Error Rate (WER), which measures spelling deviations from the reference, or semantic score-based metrics. However, these approaches often overlook what is understandable to humans when interpreting transcription errors. To address this limitation, a new evaluation is proposed that categorizes errors into four levels of severity, further divided into subtypes, based on objective linguistic criteria, contextual patterns, and the use of content words as the unit of analysis. This metric is applied to a benchmark of 10 state-of-the-art ASR systems on French language, encompassing both HMM-based and end-to-end models. Our findings reveal the strengths and weaknesses of each system, identifying those that provide the most comfortable reading experience for users.

* To be published in COLING 2025 Proceedings

Via

Access Paper or Ask Questions

ParaRev: Building a dataset for Scientific Paragraph Revision annotated with revision instruction

Jan 09, 2025

Léane Jourdan, Nicolas Hernandez, Richard Dufour, Florian Boudin, Akiko Aizawa

Abstract:Revision is a crucial step in scientific writing, where authors refine their work to improve clarity, structure, and academic quality. Existing approaches to automated writing assistance often focus on sentence-level revisions, which fail to capture the broader context needed for effective modification. In this paper, we explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision. The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones. To support this task, we introduce ParaRev, the first dataset of revised scientific paragraphs with an evaluation subset manually annotated with revision instructions. Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches, no matter the model or the metric considered.

* Accepted at the WRAICogs 1 workoshop (co-located with Coling 2025)

Via

Access Paper or Ask Questions

The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction

Dec 16, 2024

Arthur Amalvy, Vincent Labatut, Richard Dufour

Figure 1 for The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction

Figure 2 for The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction

Figure 3 for The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction

Figure 4 for The Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction

Abstract:The automatic extraction of character networks from literary texts is generally carried out using natural language processing (NLP) cascading pipelines. While this approach is widespread, no study exists on the impact of low-level NLP tasks on their performance. In this article, we conduct such a study on a literary dataset, focusing on the role of named entity recognition (NER) and coreference resolution when extracting co-occurrence networks. To highlight the impact of these tasks' performance, we start with gold-standard annotations, progressively add uniformly distributed errors, and observe their impact in terms of character network quality. We demonstrate that NER performance depends on the tested novel and strongly affects character detection. We also show that NER-detected mentions alone miss a lot of character co-occurrences, and that coreference resolution is needed to prevent this. Finally, we present comparison points with 2 methods based on large language models (LLMs), including a fully end-to-end one, and show that these models are outperformed by traditional NLP pipelines in terms of recall.

* 31st International Conference on Computational Linguistics, Jan 2025, Abu Dhabi, France

Via

Access Paper or Ask Questions

Whole-Graph Representation Learning For the Classification of Signed Networks

Sep 30, 2024

Noé Cecillon, Vincent Labatut, Richard Dufour, Nejat Arınık

Figure 1 for Whole-Graph Representation Learning For the Classification of Signed Networks

Figure 2 for Whole-Graph Representation Learning For the Classification of Signed Networks

Figure 3 for Whole-Graph Representation Learning For the Classification of Signed Networks

Figure 4 for Whole-Graph Representation Learning For the Classification of Signed Networks

Abstract:Graphs are ubiquitous for modeling complex systems involving structured data and relationships. Consequently, graph representation learning, which aims to automatically learn low-dimensional representations of graphs, has drawn a lot of attention in recent years. The overwhelming majority of existing methods handle unsigned graphs. However, signed graphs appear in an increasing number of application domains to model systems involving two types of opposed relationships. Several authors took an interest in signed graphs and proposed methods for providing vertex-level representations, but only one exists for whole-graph representations, and it can handle only fully connected graphs. In this article, we tackle this issue by proposing two approaches to learning whole-graph representations of general signed graphs. The first is a SG2V, a signed generalization of the whole-graph embedding method Graph2vec that relies on a modification of the Weisfeiler--Lehman relabelling procedure. The second one is WSGCN, a whole-graph generalization of the signed vertex embedding method SGCN that relies on the introduction of master nodes into the GCN. We propose several variants of both these approaches. A bottleneck in the development of whole-graph-oriented methods is the lack of data. We constitute a benchmark composed of three collections of signed graphs with corresponding ground truths. We assess our methods on this benchmark, and our results show that the signed whole-graph methods learn better representations for this task. Overall, the baseline obtains an F-measure score of 58.57, when SG2V and WSGCN reach 73.01 and 81.20, respectively. Our source code and benchmark dataset are both publicly available online.

Via

Access Paper or Ask Questions

Renard: A Modular Pipeline for Extracting Character Networks from Narrative Texts

Jul 02, 2024

Arthur Amalvy, Vincent Labatut, Richard Dufour

Abstract:Renard (Relationships Extraction from NARrative Documents) is a Python library that allows users to define custom natural language processing (NLP) pipelines to extract character networks from narrative texts. Contrary to the few existing tools, Renard can extract dynamic networks, as well as the more common static networks. Renard pipelines are modular: users can choose the implementation of each NLP subtask needed to extract a character network. This allows users to specialize pipelines to particular types of texts and to study the impact of each subtask on the extracted network.

* Journal of Open Source Software, 9(98), 6574 (2024)
* Accepted at JOSS

Via

Access Paper or Ask Questions

Zero-Shot End-To-End Spoken Question Answering In Medical Domain

Jun 09, 2024

Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier

Abstract:In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effectiveness of end-to-end (E2E) methodologies for SQA in the medical domain. Our study introduces a novel zero-shot SQA approach, compared to traditional cascade systems. Through a comprehensive evaluation conducted on a new open benchmark of 8 medical tasks and 48 hours of synthetic audio, we demonstrate that our approach requires up to 14.7 times fewer resources than a combined 1.3B parameters LLM with a 1.55B parameters ASR model while improving average accuracy by 0.5\%. These findings underscore the potential of E2E methodologies for SQA in resource-constrained contexts.

* InterSpeech 2024
* Accepted to INTERSPEECH 2024

Via

Access Paper or Ask Questions

CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions

Mar 01, 2024

Leane Jourdan, Florian Boudin, Nicolas Hernandez, Richard Dufour

Abstract:Writing a scientific article is a challenging task as it is a highly codified and specific genre, consequently proficiency in written communication is essential for effectively conveying research findings and ideas. In this article, we propose an original textual resource on the revision step of the writing process of scientific articles. This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews. Pairs of consecutive versions of an article are aligned at sentence-level while keeping paragraph location information as metadata for supporting future revision studies at the discourse level. Each pair of revised sentences is enriched with automatically extracted edits and associated revision intention. To assess the initial quality on the dataset, we conducted a qualitative study of several state-of-the-art text revision approaches and compared various evaluation metrics. Our experiments led us to question the relevance of the current evaluation methods for the text revision task.

* Accepted at LREC-Coling 2024

Via

Access Paper or Ask Questions

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Feb 29, 2024

Quentin Raymondaud, Mickael Rouvier, Richard Dufour

Figure 1 for Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Figure 2 for Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Abstract:Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.

Via

Access Paper or Ask Questions

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Feb 26, 2024

Anas Belfathi, Ygor Gallina, Nicolas Hernandez, Richard Dufour, Laura Monceaux

Figure 1 for Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Figure 2 for Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Abstract:Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained language models and code are freely available for use.

Via

Access Paper or Ask Questions