Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eduard Dragut

DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition

Apr 06, 2025

Qi Zhang, Huitong Pan, Zhijia Chen, Longin Jan Latecki, Cornelia Caragea, Eduard Dragut

Abstract:Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.

* Accepted to NAACL2025-Findings

Via

Access Paper or Ask Questions

SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents

Oct 28, 2024

Qi Zhang, Zhijia Chen, Huitong Pan, Cornelia Caragea, Longin Jan Latecki, Eduard Dragut

Abstract:Scientific information extraction (SciIE) is critical for converting unstructured knowledge from scholarly articles into structured data (entities and relations). Several datasets have been proposed for training and validating SciIE models. However, due to the high complexity and cost of annotating scientific texts, those datasets restrict their annotations to specific parts of paper, such as abstracts, resulting in the loss of diverse entity mentions and relations in context. In this paper, we release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations. To capture the intricate use and interactions among entities in full texts, our dataset contains a fine-grained tag set for relations. Additionally, we provide an out-of-distribution test set to offer a more realistic evaluation. We conduct comprehensive experiments, including state-of-the-art supervised models and our proposed LLM-based baselines, and highlight the challenges presented by our dataset, encouraging the development of innovative models to further the field of SciIE.

* EMNLP2024 Main

Via

Access Paper or Ask Questions

FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

Jul 09, 2024

Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, Longin Jan Latecki

Abstract:Flowcharts are graphical tools for representing complex concepts in concise visual representations. This paper introduces the FlowLearn dataset, a resource tailored to enhance the understanding of flowcharts. FlowLearn contains complex scientific flowcharts and simulated flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature and the simulated subset contains 10,000 flowcharts created using a customizable script. The dataset is enriched with annotations for visual components, OCR, Mermaid code representation, and VQA question-answer pairs. Despite the proven capabilities of Large Vision-Language Models (LVLMs) in various visual understanding tasks, their effectiveness in decoding flowcharts - a crucial element of scientific communication - has yet to be thoroughly investigated. The FlowLearn test set is crafted to assess the performance of LVLMs in flowchart comprehension. Our study thoroughly evaluates state-of-the-art LVLMs, identifying existing limitations and establishing a foundation for future enhancements in this relatively underexplored domain. For instance, in tasks involving simulated flowcharts, GPT-4V achieved the highest accuracy (58%) in counting the number of nodes, while Claude recorded the highest accuracy (83%) in OCR tasks. Notably, no single model excels in all tasks within the FlowLearn framework, highlighting significant opportunities for further development.

* ECAI 2024

Via

Access Paper or Ask Questions

SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Jun 20, 2024

Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, Longin Jan Latecki

Figure 1 for SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Figure 2 for SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Figure 3 for SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Figure 4 for SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Abstract:We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus's scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus's utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.

* LREC-COLING. (2024) 14407-14417
* LREC/COLING 2024

Via

Access Paper or Ask Questions

DMDD: A Large-Scale Dataset for Dataset Mentions Detection

May 19, 2023

Huitong Pan, Qi Zhang, Eduard Dragut, Cornelia Caragea, Longin Jan Latecki

Abstract:The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises of 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.

* Pre-MIT Press publication version. Submitted to TACL

Via

Access Paper or Ask Questions

Stay on Topic, Please: Aligning User Comments to the Content of a News Article

Mar 03, 2021

Jumanah Alshehri, Marija Stanojevic, Eduard Dragut, Zoran Obradovic

Figure 1 for Stay on Topic, Please: Aligning User Comments to the Content of a News Article

Figure 2 for Stay on Topic, Please: Aligning User Comments to the Content of a News Article

Figure 3 for Stay on Topic, Please: Aligning User Comments to the Content of a News Article

Figure 4 for Stay on Topic, Please: Aligning User Comments to the Content of a News Article

Abstract:Social scientists have shown that up to 50% if the content posted to a news article have no relation to its journalistic content. In this study we propose a classification algorithm to categorize user comments posted to a new article base don their alignment to its content. The alignment seek to match user comments to an article based on similarity off content, entities in discussion, and topic. We proposed a BERTAC, BAERT-based approach that learn jointly article-comment embeddings and infers the relevance class of comments. We introduce an ordinal classification loss that penalizes the difference between the predicted and true label. We conduct a thorough study to show influence of the proposed loss on the learning process. The results on five representative news outlets show that our approach can learn the comment class with up to 36% average accuracy improvement compering to the baselines, and up to 25% compering to the BA-BC model. BA-BC is out approach that consists of two models aimed to capture dis-jointly the formal language of news articles and the informal language of comments. We also conduct a user study to evaluate human labeling performance to understand the difficulty of the classification task. The user agreement on comment-article alignment is "moderate" per Krippendorff's alpha score, which suggests that the classification task is difficult.

* Accepted as a full paper at the 43rd European Conference on Information Retrieval

Via

Access Paper or Ask Questions

Birds of a Feather Flock Together: Satirical News Detection via Language Model Differentiation

Jul 04, 2020

Yigeng Zhang, Fan Yang, Yifan Zhang, Eduard Dragut, Arjun Mukherjee

Figure 1 for Birds of a Feather Flock Together: Satirical News Detection via Language Model Differentiation

Figure 2 for Birds of a Feather Flock Together: Satirical News Detection via Language Model Differentiation

Figure 3 for Birds of a Feather Flock Together: Satirical News Detection via Language Model Differentiation

Figure 4 for Birds of a Feather Flock Together: Satirical News Detection via Language Model Differentiation

Abstract:Satirical news is regularly shared in modern social media because it is entertaining with smartly embedded humor. However, it can be harmful to society because it can sometimes be mistaken as factual news, due to its deceptive character. We found that in satirical news, the lexical and pragmatical attributes of the context are the key factors in amusing the readers. In this work, we propose a method that differentiates the satirical news and true news. It takes advantage of satirical writing evidence by leveraging the difference between the prediction loss of two language models, one trained on true news and the other on satirical news, when given a new news article. We compute several statistical metrics of language model prediction loss as features, which are then used to conduct downstream classification. The proposed method is computationally effective because the language models capture the language usage differences between satirical news documents and traditional news documents, and are sensitive when applied to documents outside their domains.

* 10 pages

Via

Access Paper or Ask Questions

Stance Prediction for Contemporary Issues: Data and Experiments

May 29, 2020

Marjan Hosseinia, Eduard Dragut, Arjun Mukherjee

Figure 1 for Stance Prediction for Contemporary Issues: Data and Experiments

Figure 2 for Stance Prediction for Contemporary Issues: Data and Experiments

Figure 3 for Stance Prediction for Contemporary Issues: Data and Experiments

Figure 4 for Stance Prediction for Contemporary Issues: Data and Experiments

Abstract:We investigate whether pre-trained bidirectional transformers with sentiment and emotion information improve stance detection in long discussions of contemporary issues. As a part of this work, we create a novel stance detection dataset covering 419 different controversial issues and their related pros and cons collected by procon.org in nonpartisan format. Experimental results show that a shallow recurrent neural network with sentiment or emotion information can reach competitive results compared to fine-tuned BERT with 20x fewer parameters. We also use a simple approach that explains which input phrases contribute to stance detection.

Via

Access Paper or Ask Questions

Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features

Sep 04, 2017

Fan Yang, Arjun Mukherjee, Eduard Dragut

Figure 1 for Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features

Figure 2 for Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features

Figure 3 for Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features

Figure 4 for Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features

Abstract:Satirical news is considered to be entertainment, but it is potentially deceptive and harmful. Despite the embedded genre in the article, not everyone can recognize the satirical cues and therefore believe the news as true news. We observe that satirical cues are often reflected in certain paragraphs rather than the whole document. Existing works only consider document-level features to detect the satire, which could be limited. We consider paragraph-level linguistic features to unveil the satire by incorporating neural network and attention mechanism. We investigate the difference between paragraph-level features and document-level features, and analyze them on a large satirical news dataset. The evaluation shows that the proposed model detects satirical news effectively and reveals what features are important at which level.

* EMNLP 2017, 11 pages

Via

Access Paper or Ask Questions