Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Richard Khoury

Automated Journalistic Questions: A New Method for Extracting 5W1H in French

May 20, 2025

Richard Khoury, Maxence Verhaverbeke, Julie A. Gramaccia

Abstract:The 5W1H questions -- who, what, when, where, why and how -- are commonly used in journalism to ensure that an article describes events clearly and systematically. Answering them is a crucial prerequisites for tasks such as summarization, clustering, and news aggregation. In this paper, we design the first automated extraction pipeline to get 5W1H information from French news articles. To evaluate the performance of our algo- rithm, we also create a corpus of 250 Quebec news articles with 5W1H answers marked by four human annotators. Our results demonstrate that our pipeline performs as well in this task as the large language model GPT-4o.

* 14 pages, 5 figures, 7 tables

Via

Access Paper or Ask Questions

Association Rules Mining with Auto-Encoders

Apr 26, 2023

Théophile Berteloot, Richard Khoury, Audrey Durand

Figure 1 for Association Rules Mining with Auto-Encoders

Figure 2 for Association Rules Mining with Auto-Encoders

Figure 3 for Association Rules Mining with Auto-Encoders

Figure 4 for Association Rules Mining with Auto-Encoders

Abstract:Association rule mining is one of the most studied research fields of data mining, with applications ranging from grocery basket problems to explainable classification systems. Classical association rule mining algorithms have several limitations, especially with regards to their high execution times and number of rules produced. Over the past decade, neural network solutions have been used to solve various optimization problems, such as classification, regression or clustering. However there are still no efficient way association rules using neural networks. In this paper, we present an auto-encoder solution to mine association rule called ARM-AE. We compare our algorithm to FP-Growth and NSGAII on three categorical datasets, and show that our algorithm discovers high support and confidence rule set and has a better execution time than classical methods while preserving the quality of the rule set produced.

Via

Access Paper or Ask Questions

RISC: Generating Realistic Synthetic Bilingual Insurance Contract

Apr 09, 2023

David Beauchemin, Richard Khoury

Abstract:This paper presents RISC, an open-source Python package data generator (https://github.com/GRAAL-Research/risc). RISC generates look-alike automobile insurance contracts based on the Quebec regulatory insurance form in French and English. Insurance contracts are 90 to 100 pages long and use complex legal and insurance-specific vocabulary for a layperson. Hence, they are a much more complex class of documents than those in traditional NLP corpora. Therefore, we introduce RISCBAC, a Realistic Insurance Synthetic Bilingual Automobile Contract dataset based on the mandatory Quebec car insurance contract. The dataset comprises 10,000 French and English unannotated insurance contracts. RISCBAC enables NLP research for unsupervised automatic summarisation, question answering, text simplification, machine translation and more. Moreover, it can be further automatically annotated as a dataset for supervised tasks such as NER

* Accepted at Canadian AI conference 2023

Via

Access Paper or Ask Questions

Relationship Between Online Harmful Behaviors and Social Network Message Writing Style

Dec 14, 2022

Talia Sanchez Viera, Richard Khoury

Abstract:In this paper, we explore the relationship between an individual's writing style and the risk that they will engage in online harmful behaviors (such as cyberbullying). In particular, we consider whether measurable differences in writing style relate to different personality types, as modeled by the Big-Five personality traits and the Dark Triad traits, and can differentiate between users who do or do not engage in harmful behaviors. We study messages from nearly 2,500 users from two online communities (Twitter and Reddit) and find that we can measure significant personality differences between regular and harmful users from the writing style of as few as 100 tweets or 40 Reddit posts, aggregate these values to distinguish between healthy and harmful communities, and also use style attributes to predict which users will engage in harmful behaviors.

Via

Access Paper or Ask Questions

Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy

Dec 10, 2022

Alexandre Larouche, Audrey Durand, Richard Khoury, Caroline Sirois

Abstract:Polypharmacy, most often defined as the simultaneous consumption of five or more drugs at once, is a prevalent phenomenon in the older population. Some of these polypharmacies, deemed inappropriate, may be associated with adverse health outcomes such as death or hospitalization. Considering the combinatorial nature of the problem as well as the size of claims database and the cost to compute an exact association measure for a given drug combination, it is impossible to investigate every possible combination of drugs. Therefore, we propose to optimize the search for potentially inappropriate polypharmacies (PIPs). To this end, we propose the OptimNeuralTS strategy, based on Neural Thompson Sampling and differential evolution, to efficiently mine claims datasets and build a predictive model of the association between drug combinations and health outcomes. We benchmark our method using two datasets generated by an internally developed simulator of polypharmacy data containing 500 drugs and 100 000 distinct combinations. Empirically, our method can detect up to 33\% of PIPs while maintaining an average precision score of 99\% using 10 000 time steps.

Via

Access Paper or Ask Questions

Cambrian Explosion Algorithm for Multi-Objective Association Rules Mining

Nov 23, 2022

Théophile Berteloot, Richard Khoury, Audrey Durand

Abstract:Association rule mining is one of the most studied research fields of data mining, with applications ranging from grocery basket problems to highly explainable classification systems. Classical association rule mining algorithms have several flaws especially with regards to their execution times, memory usage and number of rules produced. An alternative is the use of meta-heuristics, which have been used on several optimisation problems. This paper has two objectives. First, we provide a comparison of the performances of state-of-the-art meta-heuristics on the association rule mining problem. We use the multi-objective versions of those algorithms using support, confidence and cosine. Second, we propose a new algorithm designed to mine rules efficiently from massive datasets by exploring a large variety of solutions, akin to the explosion of species diversity of the Cambrian Explosion. We compare our algorithm to 20 benchmark algorithms on 22 real-world data-sets, and show that our algorithm present good results and outperform several state-of-the-art algorithms.

Via

Access Paper or Ask Questions

Quantifying French Document Complexity

Aug 27, 2022

Vincent Primpied, David Beauchemin, Richard Khoury

Figure 1 for Quantifying French Document Complexity

Figure 2 for Quantifying French Document Complexity

Figure 3 for Quantifying French Document Complexity

Figure 4 for Quantifying French Document Complexity

Abstract:Measuring a document's complexity level is an open challenge, particularly when one is working on a diverse corpus of documents rather than comparing several documents on a similar topic or working on a language other than English. In this paper, we define a methodology to measure the complexity of French documents, using a new general and diversified corpus of texts, the "French Canadian complexity level corpus", and a wide range of metrics. We compare different learning algorithms to this task and contrast their performances and their observations on which characteristics of the texts are more significant to their complexity. Our results show that our methodology gives a general-purpose measurement of text complexity in French.

* Accepted in CAIA 2022

Via

Access Paper or Ask Questions

A Novel Word Sense Disambiguation Approach Using WordNet Knowledge Graph

Jan 08, 2021

Mohannad AlMousa, Rachid Benlamri, Richard Khoury

Figure 1 for A Novel Word Sense Disambiguation Approach Using WordNet Knowledge Graph

Figure 2 for A Novel Word Sense Disambiguation Approach Using WordNet Knowledge Graph

Figure 3 for A Novel Word Sense Disambiguation Approach Using WordNet Knowledge Graph

Figure 4 for A Novel Word Sense Disambiguation Approach Using WordNet Knowledge Graph

Abstract:Various applications in computational linguistics and artificial intelligence rely on high-performing word sense disambiguation techniques to solve challenging tasks such as information retrieval, machine translation, question answering, and document clustering. While text comprehension is intuitive for humans, machines face tremendous challenges in processing and interpreting a human's natural language. This paper presents a novel knowledge-based word sense disambiguation algorithm, namely Sequential Contextual Similarity Matrix Multiplication (SCSMM). The SCSMM algorithm combines semantic similarity, heuristic knowledge, and document context to respectively exploit the merits of local context between consecutive terms, human knowledge about terms, and a document's main topic in disambiguating terms. Unlike other algorithms, the SCSMM algorithm guarantees the capture of the maximum sentence context while maintaining the terms' order within the sentence. The proposed algorithm outperformed all other algorithms when disambiguating nouns on the combined gold standard datasets, while demonstrating comparable results to current state-of-the-art word sense disambiguation systems when dealing with each dataset separately. Furthermore, the paper discusses the impact of granularity level, ambiguity rate, sentence size, and part of speech distribution on the performance of the proposed algorithm.

Via

Access Paper or Ask Questions

Generating Intelligible Plumitifs Descriptions: Use Case Application with Ethical Considerations

Nov 24, 2020

David Beauchemin, Nicolas Garneau, Eve Gaumond, Pierre-Luc Déziel, Richard Khoury, Luc Lamontagne

Figure 1 for Generating Intelligible Plumitifs Descriptions: Use Case Application with Ethical Considerations

Abstract:Plumitifs (dockets) were initially a tool for law clerks. Nowadays, they are used as summaries presenting all the steps of a judicial case. Information concerning parties' identity, jurisdiction in charge of administering the case, and some information relating to the nature and the course of the preceding are available through plumitifs. They are publicly accessible but barely understandable; they are written using abbreviations and referring to provisions from the Criminal Code of Canada, which makes them hard to reason about. In this paper, we propose a simple yet efficient multi-source language generation architecture that leverages both the plumitif and the Criminal Code's content to generate intelligible plumitifs descriptions. It goes without saying that ethical considerations rise with these sensitive documents made readable and available at scale, legitimate concerns that we address in this paper.

* INLG 2020

Via

Access Paper or Ask Questions

Exploiting Non-Taxonomic Relations for Measuring Semantic Similarity and Relatedness in WordNet

Jun 22, 2020

Mohannad AlMousa, Rachid Benlamri, Richard Khoury

Figure 1 for Exploiting Non-Taxonomic Relations for Measuring Semantic Similarity and Relatedness in WordNet

Figure 2 for Exploiting Non-Taxonomic Relations for Measuring Semantic Similarity and Relatedness in WordNet

Figure 3 for Exploiting Non-Taxonomic Relations for Measuring Semantic Similarity and Relatedness in WordNet

Figure 4 for Exploiting Non-Taxonomic Relations for Measuring Semantic Similarity and Relatedness in WordNet

Abstract:Various applications in the areas of computational linguistics and artificial intelligence employ semantic similarity to solve challenging tasks, such as word sense disambiguation, text classification, information retrieval, machine translation, and document clustering. Previous work on semantic similarity followed a mono-relational approach using mostly the taxonomic relation "ISA". This paper explores the benefits of using all types of non-taxonomic relations in large linked data, such as WordNet knowledge graph, to enhance existing semantic similarity and relatedness measures. We propose a holistic poly-relational approach based on a new relation-based information content and non-taxonomic-based weighted paths to devise a comprehensive semantic similarity and relatedness measure. To demonstrate the benefits of exploiting non-taxonomic relations in a knowledge graph, we used three strategies to deploy non-taxonomic relations at different granularity levels. We conducted experiments on four well-known gold standard datasets, and the results demonstrated the robustness and scalability of the proposed semantic similarity and relatedness measure, which significantly improves existing similarity measures.

Via

Access Paper or Ask Questions