Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martin Lentschat

SIGMA, GETALP

Detection of tortured phrases in scientific literature

Feb 02, 2024

Eléna Martel, Martin Lentschat, Cyril Labbé

Figure 1 for Detection of tortured phrases in scientific literature

Figure 2 for Detection of tortured phrases in scientific literature

Figure 3 for Detection of tortured phrases in scientific literature

Figure 4 for Detection of tortured phrases in scientific literature

Abstract:This paper presents various automatic detection methods to extract so called tortured phrases from scientific papers. These tortured phrases, e.g. flag to clamor instead of signal to noise, are the results of paraphrasing tools used to escape plagiarism detection. We built a dataset and evaluated several strategies to flag previously undocumented tortured phrases. The proposed and tested methods are based on language models and either on embeddings similarities or on predictions of masked token. We found that an approach using token prediction and that propagates the scores to the chunk level gives the best results. With a recall value of .87 and a precision value of .61, it could retrieve new tortured phrases to be submitted to domain experts for validation.

* Proceedings of the 2nd Workshop on Information Extraction from Scientific Publications, Nov 2023, Bali, Indonesia

Via

Access Paper or Ask Questions

NanoNER: Named Entity Recognition for nanobiology using experts' knowledge and distant supervision

Jan 30, 2024

Martin Lentschat, Cyril Labbé, Ran Cheng

Abstract:Here we present the training and evaluation of NanoNER, a Named Entity Recognition (NER) model for Nanobiology. NER consists in the identification of specific entities in spans of unstructured texts and is often a primary task in Natural Language Processing (NLP) and Information Extraction. The aim of our model is to recognise entities previously identified by domain experts as constituting the essential knowledge of the domain. Relying on ontologies, which provide us with a domain vocabulary and taxonomy, we implemented an iterative process enabling experts to determine the entities relevant to the domain at hand. We then delve into the potential of distant supervision learning in NER, supporting how this method can increase the quantity of annotated data with minimal additional manpower. On our full corpus of 728 full-text nanobiology articles, containing more than 120k entity occurrences, NanoNER obtained a F1-score of 0.98 on the recognition of previously known entities. Our model also demonstrated its ability to discover new entities in the text, with precision scores ranging from 0.77 to 0.81. Ablation experiments further confirmed this and allowed us to assess the dependency of our approach on the external resources. It highlighted the dependency of the approach to the resource, while also confirming its ability to rediscover up to 30% of the ablated terms. This paper details the methodology employed, experimental design, and key findings, providing valuable insights and directions for future related researches on NER in specialized domain. Furthermore, since our approach require minimal manpower , we believe that it can be generalized to other specialized fields.

Via

Access Paper or Ask Questions

Investigating the detection of Tortured Phrases in Scientific Literature

Oct 24, 2022

Puthineath Lay, Martin Lentschat, Cyril Labbé

Figure 1 for Investigating the detection of Tortured Phrases in Scientific Literature

Figure 2 for Investigating the detection of Tortured Phrases in Scientific Literature

Figure 3 for Investigating the detection of Tortured Phrases in Scientific Literature

Abstract:With the help of online tools, unscrupulous authors can today generate a pseudo-scientific article and attempt to publish it. Some of these tools work by replacing or paraphrasing existing texts to produce new content, but they have a tendency to generate nonsensical expressions. A recent study introduced the concept of 'tortured phrase', an unexpected odd phrase that appears instead of the fixed expression. E.g. counterfeit consciousness instead of artificial intelligence. The present study aims at investigating how tortured phrases, that are not yet listed, can be detected automatically. We conducted several experiments, including non-neural binary classification, neural binary classification and cosine similarity comparison of the phrase tokens, yielding noticeable results.

Via

Access Paper or Ask Questions