Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paul Thomas

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Feb 19, 2025

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, Emine Yilmaz

Abstract:Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/

* 11 pages

Via

Access Paper or Ask Questions

Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Jan 29, 2025

Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, Paul Thomas

Figure 1 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Figure 2 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Figure 3 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Figure 4 for Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study

Abstract:This paper explores the utility of a Large Language Model (LLM) to automatically generate queries and query variants from a description of an information need. Given a set of information needs described as backstories, we explore how similar the queries generated by the LLM are to those generated by humans. We quantify the similarity using different metrics and examine how the use of each set would contribute to document pooling when building test collections. Our results show potential in using LLMs to generate query variants. While they may not fully capture the wide variety of human-generated variants, they generate similar sets of relevant documents, reaching up to 71.1% overlap at a pool depth of 100.

* Published in the proceedings of SIGIR'23

Via

Access Paper or Ask Questions

LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Jan 29, 2025

Marwah Alaofi, Paul Thomas, Falk Scholer, Mark Sanderson

Figure 1 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Figure 2 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Figure 3 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Figure 4 for LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant)

Abstract:LLMs are increasingly being used to assess the relevance of information objects. This work reports on experiments to study the labelling of short texts (i.e., passages) for relevance, using multiple open-source and proprietary LLMs. While the overall agreement of some LLMs with human judgements is comparable to human-to-human agreement measured in previous research, LLMs are more likely to label passages as relevant compared to human judges, indicating that LLM labels denoting non-relevance are more reliable than those indicating relevance. This observation prompts us to further examine cases where human judges and LLMs disagree, particularly when the human judge labels the passage as non-relevant and the LLM labels it as relevant. Results show a tendency for many LLMs to label passages that include the original query terms as relevant. We, therefore, conduct experiments to inject query words into random and irrelevant passages, not unlike the way we inserted the query "best caf\'e near me" into this paper. The results show that LLMs are highly influenced by the presence of query words in the passages under assessment, even if the wider passage has no relevance to the query. This tendency of LLMs to be fooled by the mere presence of query words demonstrates a weakness in our current measures of LLM labelling: relying on overall agreement misses important patterns of failures. There is a real risk of bias in LLM-generated relevance labels and, therefore, a risk of bias in rankers trained on those labels. We also investigate the effects of deliberately manipulating LLMs by instructing them to label passages as relevant, similar to the instruction "this paper is perfectly relevant" inserted above. We find that such manipulation influences the performance of some LLMs, highlighting the critical need to consider potential vulnerabilities when deploying LLMs in real-world applications.

* Published in the proceedings of SIGIR-AP'24

Via

Access Paper or Ask Questions

SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Aug 30, 2024

Hossein A. Rahmani, Xi Wang, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Paul Thomas

Figure 1 for SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Figure 2 for SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Figure 3 for SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Figure 4 for SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Abstract:Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.

* 9 pages, resource paper

Via

Access Paper or Ask Questions

LLMJudge: LLMs for Relevance Judgments

Aug 09, 2024

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Paul Thomas, Charles L. A. Clarke, Mohammad Aliannejadi, Clemencia Siro, Guglielmo Faggioli

Figure 1 for LLMJudge: LLMs for Relevance Judgments

Figure 2 for LLMJudge: LLMs for Relevance Judgments

Abstract:The LLMJudge challenge is organized as part of the LLM4Eval workshop at SIGIR 2024. Test collections are essential for evaluating information retrieval (IR) systems. The evaluation and tuning of a search system is largely based on relevance labels, which indicate whether a document is useful for a specific search and user. However, collecting relevance judgments on a large scale is costly and resource-intensive. Consequently, typical experiments rely on third-party labelers who may not always produce accurate annotations. The LLMJudge challenge aims to explore an alternative approach by using LLMs to generate relevance judgments. Recent studies have shown that LLMs can generate reliable relevance judgments for search systems. However, it remains unclear which LLMs can match the accuracy of human labelers, which prompts are most effective, how fine-tuned open-source LLMs compare to closed-source LLMs like GPT-4, whether there are biases in synthetically generated data, and if data leakage affects the quality of generated labels. This challenge will investigate these questions, and the collected data will be released as a package to support automatic relevance judgment research in information retrieval and search.

* LLMJudge Challenge Overview, 3 pages

Via

Access Paper or Ask Questions

Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024

Aug 09, 2024

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, Emine Yilmaz

Figure 1 for Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024

Figure 2 for Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024

Abstract:The first edition of the workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) took place in July 2024, co-located with the ACM SIGIR Conference 2024 in the USA (SIGIR 2024). The aim was to bring information retrieval researchers together around the topic of LLMs for evaluation in information retrieval that gathered attention with the advancement of large language models and generative AI. Given the novelty of the topic, the workshop was focused around multi-sided discussions, namely panels and poster sessions of the accepted proceedings papers.

* LLM4Eval Workshop Report

Via

Access Paper or Ask Questions

Large language models can accurately predict searcher preferences

Sep 19, 2023

Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra

Abstract:Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality ``gold'' labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

Via

Access Paper or Ask Questions

Taking Search to Task

Jan 12, 2023

Chirag Shah, Ryen W. White, Paul Thomas, Bhaskar Mitra, Shawon Sarkar, Nicholas Belkin

Abstract:The importance of tasks in information retrieval (IR) has been long argued for, addressed in different ways, often ignored, and frequently revisited. For decades, scholars made a case for the role that a user's task plays in how and why that user engages in search and what a search system should do to assist. But for the most part, the IR community has been too focused on query processing and assuming a search task to be a collection of user queries, often ignoring if or how such an assumption addresses the users accomplishing their tasks. With emerging areas of conversational agents and proactive IR, understanding and addressing users' tasks has become more important than ever before. In this paper, we provide various perspectives on where the state-of-the-art is with regard to tasks in IR, what are some of the bottlenecks in deriving and using task information, and how do we go forward from here. In addition to covering relevant literature, the paper provides a synthesis of historical and current perspectives on understanding, extracting, and addressing task-focused search. To ground ongoing and future research in this area, we present a new framing device for tasks using a tree-like structure and various moves on that structure that allow different interpretations and applications. Presented as a combination of synthesis of ideas and past works, proposals for future research, and our perspectives on technical, social, and ethical considerations, this paper is meant to help revitalize the interest and future work in task-based IR.

Via

Access Paper or Ask Questions

Analysing Mixed Initiatives and Search Strategies during Conversational Search

Sep 13, 2021

Mohammad Aliannejadi, Leif Azzopardi, Hamed Zamani, Evangelos Kanoulas, Paul Thomas, Nick Craswel

Figure 1 for Analysing Mixed Initiatives and Search Strategies during Conversational Search

Figure 2 for Analysing Mixed Initiatives and Search Strategies during Conversational Search

Figure 3 for Analysing Mixed Initiatives and Search Strategies during Conversational Search

Figure 4 for Analysing Mixed Initiatives and Search Strategies during Conversational Search

Abstract:Information seeking conversations between users and Conversational Search Agents (CSAs) consist of multiple turns of interaction. While users initiate a search session, ideally a CSA should sometimes take the lead in the conversation by obtaining feedback from the user by offering query suggestions or asking for query clarifications i.e. mixed initiative. This creates the potential for more engaging conversational searches, but substantially increases the complexity of modelling and evaluating such scenarios due to the large interaction space coupled with the trade-offs between the costs and benefits of the different interactions. In this paper, we present a model for conversational search -- from which we instantiate different observed conversational search strategies, where the agent elicits: (i) Feedback-First, or (ii) Feedback-After. Using 49 TREC WebTrack Topics, we performed an analysis comparing how well these different strategies combine with different mixed initiative approaches: (i) Query Suggestions vs. (ii) Query Clarifications. Our analysis reveals that there is no superior or dominant combination, instead it shows that query clarifications are better when asked first, while query suggestions are better when asked after presenting results. We also show that the best strategy and approach depends on the trade-offs between the relative costs between querying and giving feedback, the performance of the initial query, the number of assessments per query, and the total amount of gain required. While this work highlights the complexities and challenges involved in analyzing CSAs, it provides the foundations for evaluating conversational strategies and conversational search agents in batch/offline settings.

* Accepted in CIKM 2021

Via

Access Paper or Ask Questions