Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Schlichtkrull

Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts

May 24, 2025

Eric Chamoun, Nedjma Ousidhoum, Michael Schlichtkrull, Andreas Vlachos

Abstract:Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning. We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset-achieving consistent improvements over strong LLM baselines. Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in vague or underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.

Via

Access Paper or Ask Questions

AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web

May 23, 2025

Rui Cao, Zifeng Ding, Zhijiang Guo, Michael Schlichtkrull, Andreas Vlachos

Abstract:Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation. Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict. In this work, we introduce AVerImaTeC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict. We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVerImaTeC via inter-annotator studies, achieving a $\kappa=0.742$ on verdicts and $74.7\%$ consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.

Via

Access Paper or Ask Questions

Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs

Nov 20, 2024

Rui Cao, Yuming Jiang, Michael Schlichtkrull, Andreas Vlachos

Abstract:Multimodal Large Language Models (MLLMs) can enhance trustworthiness by aligning with human preferences. As human preference labeling is laborious, recent works employ evaluation models for assessing MLLMs' responses, using the model-based assessments to automate preference dataset construction. This approach, however, faces challenges with MLLMs' lengthy and compositional responses, which often require diverse reasoning skills that a single evaluation model may not fully possess. Additionally, most existing methods rely on closed-source models as evaluators. To address limitations, we propose DecompGen, a decomposable framework that uses an ensemble of open-sourced expert models. DecompGen breaks down each response into atomic verification tasks, assigning each task to an appropriate expert model to generate fine-grained assessments. The DecompGen feedback is used to automatically construct our preference dataset, DGPref. MLLMs aligned with DGPref via preference learning show improvements in trustworthiness, demonstrating the effectiveness of DecompGen.

Via

Access Paper or Ask Questions

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Nov 08, 2024

Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos

Abstract:Current automated fact-checking (AFC) approaches commonly evaluate evidence either implicitly via the predicted verdicts or by comparing retrieved evidence with a predefined closed knowledge source, such as Wikipedia. However, these methods suffer from limitations, resulting from their reliance on evaluation metrics developed for different purposes and constraints imposed by closed knowledge sources. Recent advances in natural language generation (NLG) evaluation offer new possibilities for evidence assessment. In this work, we introduce Ev2R, an evaluation framework for AFC that comprises three types of approaches for evidence evaluation: reference-based, proxy-reference, and reference-less. We evaluate their effectiveness through agreement with human ratings and adversarial tests, and demonstrate that prompt-based scorers, particularly those leveraging LLMs and reference evidence, outperform traditional evaluation approaches.

* 10 pages

Via

Access Paper or Ask Questions

The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Oct 31, 2024

Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng, Mubashara Akhtar, Rami Aly, Zhijiang Guo, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal(+2 more)

Figure 1 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Figure 2 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Figure 3 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Figure 4 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Abstract:The Automated Verification of Textual Claims (AVeriTeC) shared task asks participants to retrieve evidence and predict veracity for real-world claims checked by fact-checkers. Evidence can be found either via a search engine, or via a knowledge store provided by the organisers. Submissions are evaluated using AVeriTeC score, which considers a claim to be accurately verified if and only if both the verdict is correct and retrieved evidence is considered to meet a certain quality threshold. The shared task received 21 submissions, 18 of which surpassed our baseline. The winning team was TUDA_MAI with an AVeriTeC score of 63%. In this paper we describe the shared task, present the full results, and highlight key takeaways from the shared task.

Via

Access Paper or Ask Questions

Generating Media Background Checks for Automated Source Critical Reasoning

Sep 01, 2024

Michael Schlichtkrull

Figure 1 for Generating Media Background Checks for Automated Source Critical Reasoning

Figure 2 for Generating Media Background Checks for Automated Source Critical Reasoning

Figure 3 for Generating Media Background Checks for Automated Source Critical Reasoning

Figure 4 for Generating Media Background Checks for Automated Source Critical Reasoning

Abstract:Not everything on the internet is true. This unfortunate fact requires both humans and models to perform complex reasoning about credibility when working with retrieved information. In NLP, this problem has seen little attention. Indeed, retrieval-augmented models are not typically expected to distrust retrieved documents. Human experts overcome the challenge by gathering signals about the context, reliability, and tendency of source documents - that is, they perform source criticism. We propose a novel NLP task focused on finding and summarising such signals. We introduce a new dataset of 6,709 "media background checks" derived from Media Bias / Fact Check, a volunteer-run website documenting media bias. We test open-source and closed-source LLM baselines with and without retrieval on this dataset, finding that retrieval greatly improves performance. We furthermore carry out human evaluation, demonstrating that 1) media background checks are helpful for humans, and 2) media background checks are helpful for retrieval-augmented models.

Via

Access Paper or Ask Questions

AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web

May 24, 2023

Michael Schlichtkrull, Zhijiang Guo, Andreas Vlachos

Abstract:Existing datasets for automated fact-checking have substantial limitations, such as relying on artificial claims, lacking annotations for evidence and intermediate reasoning, or including evidence published after the claim. In this paper we introduce AVeriTeC, a new dataset of 4,568 real-world claims covering fact-checks by 50 different organizations. Each claim is annotated with question-answer pairs supported by evidence available online, as well as textual justifications explaining how the evidence combines to produce a verdict. Through a multi-round annotation process, we avoid common pitfalls including context dependence, evidence insufficiency, and temporal leakage, and reach a substantial inter-annotator agreement of $\kappa=0.619$ on verdicts. We develop a baseline as well as an evaluation scheme for verifying claims through several question-answering steps against the open web.

Via

Access Paper or Ask Questions

The Intended Uses of Automated Fact-Checking Artefacts: Why, How and Who

Apr 27, 2023

Michael Schlichtkrull, Nedjma Ousidhoum, Andreas Vlachos

Abstract:Automated fact-checking is often presented as an epistemic tool that fact-checkers, social media consumers, and other stakeholders can use to fight misinformation. Nevertheless, few papers thoroughly discuss how. We document this by analysing 100 highly-cited papers, and annotating epistemic elements related to intended use, i.e., means, ends, and stakeholders. We find that narratives leaving out some of these aspects are common, that many papers propose inconsistent means and ends, and that the feasibility of suggested strategies rarely has empirical backing. We argue that this vagueness actively hinders the technology from reaching its goals, as it encourages overclaiming, limits criticism, and prevents stakeholder feedback. Accordingly, we provide several recommendations for thinking and writing about the use of fact-checking artefacts.

Via

Access Paper or Ask Questions

A Survey on Automated Fact-Checking

Aug 26, 2021

Zhijiang Guo, Michael Schlichtkrull, Andreas Vlachos

Figure 1 for A Survey on Automated Fact-Checking

Figure 2 for A Survey on Automated Fact-Checking

Figure 3 for A Survey on Automated Fact-Checking

Figure 4 for A Survey on Automated Fact-Checking

Abstract:Fact-checking has become increasingly important due to the speed with which both information and misinformation can spread in the modern media ecosystem. Therefore, researchers have been exploring how fact-checking can be automated, using techniques based on natural language processing, machine learning, knowledge representation, and databases to automatically predict the veracity of claims. In this paper, we survey automated fact-checking stemming from natural language processing, and discuss its connections to related tasks and disciplines. In this process, we present an overview of existing datasets and models, aiming to unify the various definitions given and identify common concepts. Finally, we highlight challenges for future research.

* 27 pages, 15 pages of references

Via

Access Paper or Ask Questions

FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Jun 10, 2021

Rami Aly, Zhijiang Guo, Michael Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal

Figure 1 for FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Figure 2 for FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Figure 3 for FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Figure 4 for FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Abstract:Fact verification has attracted a lot of attention in the machine learning and natural language processing communities, as it is one of the key methods for detecting misinformation. Existing large-scale benchmarks for this task have focused mostly on textual sources, i.e. unstructured information, and thus ignored the wealth of information available in structured formats, such as tables. In this paper we introduce a novel dataset and benchmark, Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict. Furthermore, we detail our efforts to track and minimize the biases present in the dataset and could be exploited by models, e.g. being able to predict the label without using evidence. Finally, we develop a baseline for verifying claims against text and tables which predicts both the correct evidence and verdict for 18% of the claims.

Via

Access Paper or Ask Questions