Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Sanderson

LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations

Apr 27, 2025

Laura Dietz, Oleg Zendel, Peter Bailey, Charles Clarke, Ellese Cotterill, Jeff Dalton, Faegheh Hasibi, Mark Sanderson, Nick Craswell

Abstract:Large Language Models (LLMs) are increasingly used to evaluate information retrieval (IR) systems, generating relevance judgments traditionally made by human assessors. Recent empirical studies suggest that LLM-based evaluations often align with human judgments, leading some to suggest that human judges may no longer be necessary, while others highlight concerns about judgment reliability, validity, and long-term impact. As IR systems begin incorporating LLM-generated signals, evaluation outcomes risk becoming self-reinforcing, potentially leading to misleading conclusions. This paper examines scenarios where LLM-evaluators may falsely indicate success, particularly when LLM-based judgments influence both system development and evaluation. We highlight key risks, including bias reinforcement, reproducibility challenges, and inconsistencies in assessment methodologies. To address these concerns, we propose tests to quantify adverse effects, guardrails, and a collaborative framework for constructing reusable test collections that integrate LLM judgments responsibly. By providing perspectives from academia and industry, this work aims to establish best practices for the principled use of LLMs in IR evaluation.

Via

Access Paper or Ask Questions

Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Jan 29, 2025

Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, Paul Thomas

Figure 1 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Figure 2 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Figure 3 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Figure 4 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Abstract:This paper explores the utility of a Large Language Model (LLM) to automatically generate queries and query variants from a description of an information need. Given a set of information needs described as backstories, we explore how similar the queries generated by the LLM are to those generated by humans. We quantify the similarity using different metrics and examine how the use of each set would contribute to document pooling when building test collections. Our results show potential in using LLMs to generate query variants. While they may not fully capture the wide variety of human-generated variants, they generate similar sets of relevant documents, reaching up to 71.1% overlap at a pool depth of 100.

* Published in the proceedings of SIGIR'23

Via

Access Paper or Ask Questions

LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Jan 29, 2025

Marwah Alaofi, Paul Thomas, Falk Scholer, Mark Sanderson

Figure 1 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Figure 2 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Figure 3 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Figure 4 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Abstract:LLMs are increasingly being used to assess the relevance of information objects. This work reports on experiments to study the labelling of short texts (i.e., passages) for relevance, using multiple open-source and proprietary LLMs. While the overall agreement of some LLMs with human judgements is comparable to human-to-human agreement measured in previous research, LLMs are more likely to label passages as relevant compared to human judges, indicating that LLM labels denoting non-relevance are more reliable than those indicating relevance. This observation prompts us to further examine cases where human judges and LLMs disagree, particularly when the human judge labels the passage as non-relevant and the LLM labels it as relevant. Results show a tendency for many LLMs to label passages that include the original query terms as relevant. We, therefore, conduct experiments to inject query words into random and irrelevant passages, not unlike the way we inserted the query "best caf\'e near me" into this paper. The results show that LLMs are highly influenced by the presence of query words in the passages under assessment, even if the wider passage has no relevance to the query. This tendency of LLMs to be fooled by the mere presence of query words demonstrates a weakness in our current measures of LLM labelling: relying on overall agreement misses important patterns of failures. There is a real risk of bias in LLM-generated relevance labels and, therefore, a risk of bias in rankers trained on those labels. We also investigate the effects of deliberately manipulating LLMs by instructing them to label passages as relevant, similar to the instruction "this paper is perfectly relevant" inserted above. We find that such manipulation influences the performance of some LLMs, highlighting the critical need to consider potential vulnerabilities when deploying LLMs in real-world applications.

* Published in the proceedings of SIGIR-AP'24

Via

Access Paper or Ask Questions

Metamorphic Evaluation of ChatGPT as a Recommender System

Nov 18, 2024

Madhurima Khirbat, Yongli Ren, Pablo Castells, Mark Sanderson

Abstract:With the rise of Large Language Models (LLMs) such as ChatGPT, researchers have been working on how to utilize the LLMs for better recommendations. However, although LLMs exhibit black-box and probabilistic characteristics (meaning their internal working is not visible), the evaluation framework used for assessing these LLM-based recommender systems (RS) are the same as those used for traditional recommender systems. To address this gap, we introduce the metamorphic testing for the evaluation of GPT-based RS. This testing technique involves defining of metamorphic relations (MRs) between the inputs and checking if the relationship has been satisfied in the outputs. Specifically, we examined the MRs from both RS and LLMs perspectives, including rating multiplication/shifting in RS and adding spaces/randomness in the LLMs prompt via prompt perturbation. Similarity metrics (e.g. Kendall tau and Ranking Biased Overlap(RBO)) are deployed to measure whether the relationship has been satisfied in the outputs of MRs. The experiment results on MovieLens dataset with GPT3.5 show that lower similarity are obtained in terms of Kendall $\tau$ and RBO, which concludes that there is a need of a comprehensive evaluation of the LLM-based RS in addition to the existing evaluation metrics used for traditional recommender systems.

Via

Access Paper or Ask Questions

Performance-Driven QUBO for Recommender Systems on Quantum Annealers

Oct 20, 2024

Jiayang Niu, Jie Li, Ke Deng, Mark Sanderson, Yongli Ren

Abstract:We propose Counterfactual Analysis Quadratic Unconstrained Binary Optimization (CAQUBO) to solve QUBO problems for feature selection in recommender systems. CAQUBO leverages counterfactual analysis to measure the impact of individual features and feature combinations on model performance and employs the measurements to construct the coefficient matrix for a quantum annealer to select the optimal feature combinations for recommender systems, thereby improving their final recommendation performance. By establishing explicit connections between features and the recommendation performance, the proposed approach demonstrates superior performance compared to the state-of-the-art quantum annealing methods. Extensive experiments indicate that integrating quantum computing with counterfactual analysis holds great promise for addressing these challenges.

Via

Access Paper or Ask Questions

Perfect Counterfactuals in Imperfect Worlds: Modelling Noisy Implementation of Actions in Sequential Algorithmic Recourse

Oct 03, 2024

Yueqing Xuan, Kacper Sokol, Mark Sanderson, Jeffrey Chan

Abstract:Algorithmic recourse provides actions to individuals who have been adversely affected by automated decision-making and helps them achieve a desired outcome. Knowing the recourse, however, does not guarantee that users would implement it perfectly, either due to environmental variability or personal choices. Recourse generation should thus anticipate its sub-optimal or noisy implementation. While several approaches have constructed recourse that accounts for robustness to small perturbation (i.e., noisy recourse implementation), they assume an entire recourse to be implemented in a single step and thus apply one-off uniform noise to it. Such assumption is unrealistic since recourse often includes multiple sequential steps which becomes harder to implement and subject to more noise. In this work, we consider recourse under plausible noise that adapts to the local data geometry and accumulates at every step of the way. We frame this problem as a Markov Decision Process and demonstrate that the distribution of our plausible noise satisfies the Markov property. We then propose the RObust SEquential (ROSE) recourse generator to output a sequence of steps that will lead to the desired outcome even under imperfect implementation. Given our plausible modelling of sub-optimal human actions and greater recourse robustness to accumulated uncertainty, ROSE can grant users higher chances of success under low recourse costs. Empirical evaluation shows our algorithm manages the inherent trade-off between recourse robustness and costs more effectively while ensuring its low sparsity and fast computation.

Via

Access Paper or Ask Questions

Generative Information Retrieval Evaluation

Apr 11, 2024

Marwah Alaofi, Negar Arabzadeh, Charles L. A. Clarke, Mark Sanderson

Abstract:In this chapter, we consider generative information retrieval evaluation from two distinct but interrelated perspectives. First, large language models (LLMs) themselves are rapidly becoming tools for evaluation, with current research indicating that LLMs may be superior to crowdsource workers and other paid assessors on basic relevance judgement tasks. We review past and ongoing related research, including speculation on the future of shared task initiatives, such as TREC, and a discussion on the continuing need for human assessments. Second, we consider the evaluation of emerging LLM-based generative information retrieval (GenIR) systems, including retrieval augmented generation (RAG) systems. We consider approaches that focus both on the end-to-end evaluation of GenIR systems and on the evaluation of a retrieval component as an element in a RAG system. Going forward, we expect the evaluation of GenIR systems to be at least partially based on LLM-based assessment, creating an apparent circularity, with a system seemingly evaluating its own output. We resolve this apparent circularity in two ways: 1) by viewing LLM-based assessment as a form of "slow search", where a slower IR system is used for evaluation and training of a faster production IR system; and 2) by recognizing a continuing need to ground evaluation in human assessment, even if the characteristics of that human assessment must change.

* Draft of a chapter intended to appear in a forthcoming book on generative information retrieval, co-edited by Chirag Shah and Ryen White

Via

Access Paper or Ask Questions

Online and Offline Evaluation in Search Clarification

Mar 14, 2024

Leila Tavakoli, Johanne R. Trippas, Hamed Zamani, Falk Scholer, Mark Sanderson

Abstract:The effectiveness of clarification question models in engaging users within search systems is currently constrained, casting doubt on their overall usefulness. To improve the performance of these models, it is crucial to employ assessment approaches that encompass both real-time feedback from users (online evaluation) and the characteristics of clarification questions evaluated through human assessment (offline evaluation). However, the relationship between online and offline evaluations has been debated in information retrieval. This study aims to investigate how this discordance holds in search clarification. We use user engagement as ground truth and employ several offline labels to investigate to what extent the offline ranked lists of clarification resemble the ideal ranked lists based on online user engagement.

* 27 pages

Via

Access Paper or Ask Questions

i-Align: an interpretable knowledge graph alignment model

Aug 26, 2023

Bayu Distiawan Trisedya, Flora D Salim, Jeffrey Chan, Damiano Spina, Falk Scholer, Mark Sanderson

Abstract:Knowledge graphs (KGs) are becoming essential resources for many downstream applications. However, their incompleteness may limit their potential. Thus, continuous curation is needed to mitigate this problem. One of the strategies to address this problem is KG alignment, i.e., forming a more complete KG by merging two or more KGs. This paper proposes i-Align, an interpretable KG alignment model. Unlike the existing KG alignment models, i-Align provides an explanation for each alignment prediction while maintaining high alignment performance. Experts can use the explanation to check the correctness of the alignment prediction. Thus, the high quality of a KG can be maintained during the curation process (e.g., the merging process of two KGs). To this end, a novel Transformer-based Graph Encoder (Trans-GE) is proposed as a key component of i-Align for aggregating information from entities' neighbors (structures). Trans-GE uses Edge-gated Attention that combines the adjacency matrix and the self-attention matrix to learn a gating mechanism to control the information aggregation from the neighboring entities. It also uses historical embeddings, allowing Trans-GE to be trained over mini-batches, or smaller sub-graphs, to address the scalability issue when encoding a large KG. Another component of i-Align is a Transformer encoder for aggregating entities' attributes. This way, i-Align can generate explanations in the form of a set of the most influential attributes/neighbors based on attention weights. Extensive experiments are conducted to show the power of i-Align. The experiments include several aspects, such as the model's effectiveness for aligning KGs, the quality of the generated explanations, and its practicality for aligning large KGs. The results show the effectiveness of i-Align in these aspects.

* Data Min Knowl Disc (2023)

Via

Access Paper or Ask Questions

Designing and Evaluating Presentation Strategies for Fact-Checked Content

Aug 20, 2023

Danula Hettiachchi, Kaixin Ji, Jenny Kennedy, Anthony McCosker, Flora Dylis Salim, Mark Sanderson, Falk Scholer, Damiano Spina

Figure 1 for Designing and Evaluating Presentation Strategies for Fact-Checked Content

Figure 2 for Designing and Evaluating Presentation Strategies for Fact-Checked Content

Figure 3 for Designing and Evaluating Presentation Strategies for Fact-Checked Content

Figure 4 for Designing and Evaluating Presentation Strategies for Fact-Checked Content

Abstract:With the rapid growth of online misinformation, it is crucial to have reliable fact-checking methods. Recent research on finding check-worthy claims and automated fact-checking have made significant advancements. However, limited guidance exists regarding the presentation of fact-checked content to effectively convey verified information to users. We address this research gap by exploring the critical design elements in fact-checking reports and investigating whether credibility and presentation-based design improvements can enhance users' ability to interpret the report accurately. We co-developed potential content presentation strategies through a workshop involving fact-checking professionals, communication experts, and researchers. The workshop examined the significance and utility of elements such as veracity indicators and explored the feasibility of incorporating interactive components for enhanced information disclosure. Building on the workshop outcomes, we conducted an online experiment involving 76 crowd workers to assess the efficacy of different design strategies. The results indicate that proposed strategies significantly improve users' ability to accurately interpret the verdict of fact-checking articles. Our findings underscore the critical role of effective presentation of fact reports in addressing the spread of misinformation. By adopting appropriate design enhancements, the effectiveness of fact-checking reports can be maximized, enabling users to make informed judgments.

* Accepted to the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23)

Via

Access Paper or Ask Questions