Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Philipp Schaer

TH Köln - University of Applied Sciences

Second SIGIR Workshop on Simulations for Information Access (Sim4IA 2025)

May 16, 2025

Philipp Schaer, Christin Katharina Kreutz, Krisztian Balog, Timo Breuer, Andreas Konstantin Kruff

Abstract:Simulations in information access (IA) have recently gained interest, as shown by various tutorials and workshops around that topic. Simulations can be key contributors to central IA research and evaluation questions, especially around interactive settings when real users are unavailable, or their participation is impossible due to ethical reasons. In addition, simulations in IA can help contribute to a better understanding of users, reduce complexity of evaluation experiments, and improve reproducibility. Building on recent developments in methods and toolkits, the second iteration of our Sim4IA workshop aims to again bring together researchers and practitioners to form an interactive and engaging forum for discussions on the future perspectives of the field. An additional aim is to plan an upcoming TREC/CLEF campaign.

* Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), July 13--18, 2025, Padua, Italy

Via

Access Paper or Ask Questions

Evaluating Contrastive Feedback for Effective User Simulations

May 05, 2025

Andreas Konstantin Kruff, Timo Breuer, Philipp Schaer

Figure 1 for Evaluating Contrastive Feedback for Effective User Simulations

Figure 2 for Evaluating Contrastive Feedback for Effective User Simulations

Abstract:The use of Large Language Models (LLMs) for simulating user behavior in the domain of Interactive Information Retrieval has recently gained significant popularity. However, their application and capabilities remain highly debated and understudied. This study explores whether the underlying principles of contrastive training techniques, which have been effective for fine-tuning LLMs, can also be applied beneficially in the area of prompt engineering for user simulations. Previous research has shown that LLMs possess comprehensive world knowledge, which can be leveraged to provide accurate estimates of relevant documents. This study attempts to simulate a knowledge state by enhancing the model with additional implicit contextual information gained during the simulation. This approach enables the model to refine the scope of desired documents further. The primary objective of this study is to analyze how different modalities of contextual information influence the effectiveness of user simulations. Various user configurations were tested, where models are provided with summaries of already judged relevant, irrelevant, or both types of documents in a contrastive manner. The focus of this study is the assessment of the impact of the prompting techniques on the simulated user agent performance. We hereby lay the foundations for leveraging LLMs as part of more realistic simulated users.

Via

Access Paper or Ask Questions

REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources

Apr 10, 2025

Björn Engelmann, Fabian Haak, Philipp Schaer, Mani Erfanian Abdoust, Linus Netze, Meik Bittkowski

Figure 1 for REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources

Figure 2 for REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources

Figure 3 for REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources

Figure 4 for REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources

Abstract:Retrieval test collections are essential for evaluating information retrieval systems, yet they often lack generalizability across tasks. To overcome this limitation, we introduce REANIMATOR, a versatile framework designed to enable the repurposing of existing test collections by enriching them with extracted and synthetic resources. REANIMATOR enhances test collections from PDF files by parsing full texts and machine-readable tables, as well as related contextual information. It then employs state-of-the-art large language models to produce synthetic relevance labels. Including an optional human-in-the-loop step can help validate the resources that have been extracted and generated. We demonstrate its potential with a revitalized version of the TREC-COVID test collection, showcasing the development of a retrieval-augmented generation system and evaluating the impact of tables on retrieval-augmented generation. REANIMATOR enables the reuse of test collections for new applications, lowering costs and broadening the utility of legacy resources.

Via

Access Paper or Ask Questions

LongEval at CLEF 2025: Longitudinal Evaluation of IR Model Performance

Mar 11, 2025

Matteo Cancellieri, Alaa El-Ebshihy, Tobias Fink, Petra Galuščáková, Gabriela Gonzalez-Saez, Lorraine Goeuriot, David Iommi, Jüri Keller, Petr Knoth, Philippe Mulhem(+3 more)

Figure 1 for LongEval at CLEF 2025: Longitudinal Evaluation of IR Model Performance

Abstract:This paper presents the third edition of the LongEval Lab, part of the CLEF 2025 conference, which continues to explore the challenges of temporal persistence in Information Retrieval (IR). The lab features two tasks designed to provide researchers with test data that reflect the evolving nature of user queries and document relevance over time. By evaluating how model performance degrades as test data diverge temporally from training data, LongEval seeks to advance the understanding of temporal dynamics in IR systems. The 2025 edition aims to engage the IR and NLP communities in addressing the development of adaptive models that can maintain retrieval quality over time in the domains of web search and scientific retrieval.

* Accepted for ECIR 2025. To be published in Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings

Via

Access Paper or Ask Questions

Counterfactual Query Rewriting to Use Historical Relevance Feedback

Feb 06, 2025

Jüri Keller, Maik Fröbe, Gijs Hendriksen, Daria Alexander, Martin Potthast, Matthias Hagen, Philipp Schaer

Abstract:When a retrieval system receives a query it has encountered before, previous relevance feedback, such as clicks or explicit judgments can help to improve retrieval results. However, the content of a previously relevant document may have changed, or the document might not be available anymore. Despite this evolved corpus, we counterfactually use these previously relevant documents as relevance signals. In this paper we proposed approaches to rewrite user queries and compare them against a system that directly uses the previous qrels for the ranking. We expand queries with terms extracted from the previously relevant documents or derive so-called keyqueries that rank the previously relevant documents to the top of the current corpus. Our evaluation in the CLEF LongEval scenario shows that rewriting queries with historical relevance feedback improves the retrieval effectiveness and even outperforms computationally expensive transformer-based approaches.

Via

Access Paper or Ask Questions

Investigating Bias in Political Search Query Suggestions by Relative Comparison with LLMs

Oct 31, 2024

Fabian Haak, Björn Engelmann, Christin Katharina Kreutz, Philipp Schaer

Figure 1 for Investigating Bias in Political Search Query Suggestions by Relative Comparison with LLMs

Figure 2 for Investigating Bias in Political Search Query Suggestions by Relative Comparison with LLMs

Figure 3 for Investigating Bias in Political Search Query Suggestions by Relative Comparison with LLMs

Abstract:Search query suggestions affect users' interactions with search engines, which then influences the information they encounter. Thus, bias in search query suggestions can lead to exposure to biased search results and can impact opinion formation. This is especially critical in the political domain. Detecting and quantifying bias in web search engines is difficult due to its topic dependency, complexity, and subjectivity. The lack of context and phrasality of query suggestions emphasizes this problem. In a multi-step approach, we combine the benefits of large language models, pairwise comparison, and Elo-based scoring to identify and quantify bias in English search query suggestions. We apply our approach to the U.S. political news domain and compare bias in Google and Bing.

Via

Access Paper or Ask Questions

Report on the Workshop on Simulations for Information Access (Sim4IA 2024) at SIGIR 2024

Sep 26, 2024

Timo Breuer, Christin Katharina Kreutz, Norbert Fuhr, Krisztian Balog, Philipp Schaer, Nolwenn Bernard, Ingo Frommholz, Marcel Gohsen, Kaixin Ji, Gareth J. F. Jones(+8 more)

Abstract:This paper is a report of the Workshop on Simulations for Information Access (Sim4IA) workshop at SIGIR 2024. The workshop had two keynotes, a panel discussion, nine lightning talks, and two breakout sessions. Key takeaways were user simulation's importance in academia and industry, the possible bridging of online and offline evaluation, and the issues of organizing a companion shared task around user simulations for information access. We report on how we organized the workshop, provide a brief overview of what happened at the workshop, and summarize the main topics and findings of the workshop and future work.

* Preprint of a SIGIR Forum submission for Vol. 58 No. 2 - December 2024

Via

Access Paper or Ask Questions

Replicability Measures for Longitudinal Information Retrieval Evaluation

Sep 09, 2024

Jüri Keller, Timo Breuer, Philipp Schaer

Abstract:Information Retrieval (IR) systems are exposed to constant changes in most components. Documents are created, updated, or deleted, the information needs are changing, and even relevance might not be static. While it is generally expected that the IR systems retain a consistent utility for the users, test collection evaluations rely on a fixed experimental setup. Based on the LongEval shared task and test collection, this work explores how the effectiveness measured in evolving experiments can be assessed. Specifically, the persistency of effectiveness is investigated as a replicability task. It is observed how the effectiveness progressively deteriorates over time compared to the initial measurement. Employing adapted replicability measures provides further insight into the persistence of effectiveness. The ranking of systems varies across retrieval measures and time. In conclusion, it was found that the most effective systems are not necessarily the ones with the most persistent performance.

* Experimental IR Meets Multilinguality, Multimodality, and Interaction - 15th International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September 9-12, 2024, Proceedings. arXiv admin note: text overlap with arXiv:2308.10549

Via

Access Paper or Ask Questions

The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs

Jul 12, 2024

Anh Thu Maria Bui, Saskia Felizitas Brech, Natalie Hußfeldt, Tobias Jennert, Melanie Ullrich, Timo Breuer, Narjes Nikzad Khasmakhi, Philipp Schaer

Figure 1 for The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs

Figure 2 for The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs

Figure 3 for The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs

Figure 4 for The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs

Abstract:Hallucination detection in Large Language Models (LLMs) is crucial for ensuring their reliability. This work presents our participation in the CLEF ELOQUENT HalluciGen shared task, where the goal is to develop evaluators for both generating and detecting hallucinated content. We explored the capabilities of four LLMs: Llama 3, Gemma, GPT-3.5 Turbo, and GPT-4, for this purpose. We also employed ensemble majority voting to incorporate all four models for the detection task. The results provide valuable insights into the strengths and weaknesses of these LLMs in handling hallucination generation and detection tasks.

* Paper accepted at ELOQUENT@CLEF'24

Via

Access Paper or Ask Questions

Evaluation of Temporal Change in IR Test Collections

Jul 01, 2024

Jüri Keller, Timo Breuer, Philipp Schaer

Figure 1 for Evaluation of Temporal Change in IR Test Collections

Figure 2 for Evaluation of Temporal Change in IR Test Collections

Figure 3 for Evaluation of Temporal Change in IR Test Collections

Figure 4 for Evaluation of Temporal Change in IR Test Collections

Abstract:Information retrieval systems have been evaluated using the Cranfield paradigm for many years. This paradigm allows a systematic, fair, and reproducible evaluation of different retrieval methods in fixed experimental environments. However, real-world retrieval systems must cope with dynamic environments and temporal changes that affect the document collection, topical trends, and the individual user's perception of what is considered relevant. Yet, the temporal dimension in IR evaluations is still understudied. To this end, this work investigates how the temporal generalizability of effectiveness evaluations can be assessed. As a conceptual model, we generalize Cranfield-type experiments to the temporal context by classifying the change in the essential components according to the create, update, and delete operations of persistent storage known from CRUD. From the different types of change different evaluation scenarios are derived and it is outlined what they imply. Based on these scenarios, renowned state-of-the-art retrieval systems are tested and it is investigated how the retrieval effectiveness changes on different levels of granularity. We show that the proposed measures can be well adapted to describe the changes in the retrieval results. The experiments conducted confirm that the retrieval effectiveness strongly depends on the evaluation scenario investigated. We find that not only the average retrieval performance of single systems but also the relative system performance are strongly affected by the components that change and to what extent these components changed.

* Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington, DC, USA

Via

Access Paper or Ask Questions