Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mathieu Roche

UMR TETIS, Cirad, Cirad-ES

Guidelines for the Creation of an Annotated Corpus

Jan 19, 2026

Bahdja Boudoua, Nadia Guiffant, Mathieu Roche, Maguelonne Teisseire, Annelise Tran

Abstract:This document, based on feedback from UMR TETIS members and the scientific literature, provides a generic methodology for creating annotation guidelines and annotated textual datasets (corpora). It covers methodological aspects, as well as storage, sharing, and valorization of the data. It includes definitions and examples to clearly illustrate each step of the process, thus providing a comprehensive framework to support the creation and use of corpora in various research contexts.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

Agro-STAY : Collecte de données et analyse des informations en agriculture alternative issues de YouTube

Dec 13, 2024

Laura Maxim, Julien Rabatel, Jean-Marc Douguet, Natalia Grabar, Roberto Interdonato, Sébastien Loustau, Mathieu Roche, Maguelonne Teisseire

Abstract:To address the current crises (climatic, social, economic), the self-sufficiency -- a set of practices that combine energy sobriety, self-production of food and energy, and self-construction - arouses an increasing interest. The CNRS STAY project (Savoirs Techniques pour l'Auto-suffisance, sur YouTube) explores this topic by analyzing techniques shared on YouTube. We present Agro-STAY, a platform designed for the collection, processing, and visualization of data from YouTube videos and their comments. We use Natural Language Processing (NLP) techniques and language models, which enable a fine-grained analysis of alternative agricultural practice described online. -- Face aux crises actuelles (climatiques, sociales, \'economiques), l'auto-suffisance -- ensemble de pratiques combinant sobri\'et\'e \'energ\'etique, autoproduction alimentaire et \'energ\'etique et autoconstruction - suscite un int\'er\^et croissant. Le projet CNRS STAY (Savoirs Techniques pour l'Auto-suffisance, sur YouTube) s'inscrit dans ce domaine en analysant les savoirs techniques diffus\'es sur YouTube. Nous pr\'esentons Agro-STAY, une plateforme d\'edi\'ee \`a la collecte, au traitement et \`a la visualisation de donn\'ees issues de vid\'eos YouTube et de leurs commentaires. En mobilisant des techniques de traitement automatique des langues (TAL) et des mod\`eles de langues, ce travail permet une analyse fine des pratiques agricoles alternatives d\'ecrites en ligne.

* 8 pages, in French language, 3 figures

Via

Access Paper or Ask Questions

A lexicon obtained and validated by a data-driven approach for organic residues valorization in emerging and developing countries

Jun 02, 2024

Christiane Rakotomalala, Jean-Marie Paillat, Frédéric Feder, Angel Avadí, Laurent Thuriès, Marie-Liesse Vermeire, Jean-Michel Médoc, Tom Wassenaar, Caroline Hottelart, Lilou Kieffer(+8 more)

Figure 1 for A lexicon obtained and validated by a data-driven approach for organic residues valorization in emerging and developing countries

Figure 2 for A lexicon obtained and validated by a data-driven approach for organic residues valorization in emerging and developing countries

Abstract:The text mining method presented in this paper was used for annotation of terms related to biological transformation and valorization of organic residues in agriculture in low and middle-income country. Specialized lexicon was obtained through different steps: corpus and extraction of terms, annotation of extracted terms, selection of relevant terms.

* 5 pages, 2 tables

Via

Access Paper or Ask Questions

Evaluation of Geographical Distortions in Language Models: A Crucial Step Towards Equitable Representations

Apr 26, 2024

Rémy Decoupes, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire, Sarah Valentin

Abstract:Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.

Via

Access Paper or Ask Questions

An evaluation framework for comparing epidemic intelligence systems

Mar 30, 2023

Nejat Arinik, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire

Abstract:In the context of Epidemic Intelligence, many Event-Based Surveillance (EBS) systems have been proposed in the literature to promote the early identification and characterization of potential health threats from online sources of any nature. Each EBS system has its own surveillance definitions and priorities, therefore this makes the task of selecting the most appropriate EBS system for a given situation a challenge for end-users. In this work, we propose a new evaluation framework to address this issue. It first transforms the raw input epidemiological event data into a set of normalized events with multi-granularity, then conducts a descriptive retrospective analysis based on four evaluation objectives: spatial, temporal, thematic and source analysis. We illustrate its relevance by applying it to an Avian Influenza dataset collected by a selection of EBS systems, and show how our framework allows identifying their strengths and drawbacks in terms of epidemic surveillance.

* IEEE Access, 2023, pp.1 - 1

Via

Access Paper or Ask Questions

Annotation of epidemiological information in animal disease-related news articles: guidelines

Jan 15, 2021

Sarah Valentin, Elena Arsevska, Aline Vilain, Valérie De Waele, Renaud Lancelot, Mathieu Roche

Figure 1 for Annotation of epidemiological information in animal disease-related news articles: guidelines

Abstract:This paper describes a method for annotation of epidemiological information in animal disease-related news articles. The annotation guidelines are generic and aim to embrace all animal or zoonotic infectious diseases, regardless of the pathogen involved or its way of transmission (e.g. vector-borne, airborne, by contact). The framework relies on the successive annotation of all the sentences from a news article. The annotator evaluates the sentences in a specific epidemiological context, corresponding to the publication of the news article.

* 8 pages

Via

Access Paper or Ask Questions

How to define co-occurrence in different domains of study?

Apr 16, 2019

Mathieu Roche

Figure 1 for How to define co-occurrence in different domains of study?

Figure 2 for How to define co-occurrence in different domains of study?

Figure 3 for How to define co-occurrence in different domains of study?

Figure 4 for How to define co-occurrence in different domains of study?

Abstract:This position paper presents a comparative study of co-occurrences. Some similarities and differences in the definition exist depending on the research domain (e.g. linguistics, NLP, computer science). This paper discusses these points, and deals with the methodological aspects in order to identify co-occurrences in a multidisciplinary paradigm.

* CICLING'2018 (International Conference on Computational Linguistics and Intelligent Text Processing) - March 18 to 24, 2018 - Hanoi, Vietnam

Via

Access Paper or Ask Questions

Preference Learning in Terminology Extraction: A ROC-based approach

Dec 13, 2005

Jérôme Azé, Mathieu Roche, Yves Kodratoff, Michèle Sebag

Figure 1 for Preference Learning in Terminology Extraction: A ROC-based approach

Figure 2 for Preference Learning in Terminology Extraction: A ROC-based approach

Figure 3 for Preference Learning in Terminology Extraction: A ROC-based approach

Figure 4 for Preference Learning in Terminology Extraction: A ROC-based approach

Abstract:A key data preparation step in Text Mining, Term Extraction selects the terms, or collocation of words, attached to specific concepts. In this paper, the task of extracting relevant collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as relevant/irrelevant. The candidate terms are described along 13 standard statistical criteria measures. From these examples, an evolutionary learning algorithm termed Roger, based on the optimization of the Area under the ROC curve criterion, extracts an order on the candidate terms. The robustness of the approach is demonstrated on two real-world domain applications, considering different domains (biology and human resources) and different languages (English and French).

* Proceeedings of Applied Stochastic Models and Data Analysis (2005) 209-219

Via

Access Paper or Ask Questions