Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chris Samarinas

Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation

Jan 07, 2025

Chris Samarinas, Alexander Krubner, Alireza Salemi, Youngwoo Kim, Hamed Zamani

Figure 1 for Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation

Figure 2 for Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation

Figure 3 for Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation

Figure 4 for Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation

Abstract:This paper presents ICAT, an evaluation framework for measuring coverage of diverse factual information in long-form text generation. ICAT breaks down a long output text into a list of atomic claims and not only verifies each claim through retrieval from a (reliable) knowledge source, but also computes the alignment between the atomic factual claims and various aspects expected to be presented in the output. We study three implementations of the ICAT framework, each with a different assumption on the availability of aspects and alignment method. By adopting data from the diversification task in the TREC Web Track and the ClueWeb corpus, we evaluate the ICAT framework. We demonstrate strong correlation with human judgments and provide comprehensive evaluation across multiple state-of-the-art LLMs. Our framework further offers interpretable and fine-grained analysis of diversity and coverage. Its modular design allows for easy adaptation to different domains and datasets, making it a valuable tool for evaluating the qualitative aspects of long-form responses produced by LLMs.

Via

Access Paper or Ask Questions

ProCIS: A Benchmark for Proactive Retrieval in Conversations

May 10, 2024

Chris Samarinas, Hamed Zamani

Figure 1 for ProCIS: A Benchmark for Proactive Retrieval in Conversations

Figure 2 for ProCIS: A Benchmark for Proactive Retrieval in Conversations

Figure 3 for ProCIS: A Benchmark for Proactive Retrieval in Conversations

Figure 4 for ProCIS: A Benchmark for Proactive Retrieval in Conversations

Abstract:The field of conversational information seeking, which is rapidly gaining interest in both academia and industry, is changing how we interact with search engines through natural language interactions. Existing datasets and methods are mostly evaluating reactive conversational information seeking systems that solely provide response to every query from the user. We identify a gap in building and evaluating proactive conversational information seeking systems that can monitor a multi-party human conversation and proactively engage in the conversation at an opportune moment by retrieving useful resources and suggestions. In this paper, we introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations. We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments through depth-k pooling. We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems. We introduce normalized proactive discounted cumulative gain (npDCG) for evaluating these systems, and further provide benchmark results for a wide range of models, including a novel model we developed for this task. We believe that the developed dataset, called ProCIS, paves the path towards developing proactive conversational information seeking systems.

Via

Access Paper or Ask Questions

Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models

Apr 23, 2024

Chris Samarinas, Pracha Promthaw, Atharva Nijasure, Hansi Zeng, Julian Killingback, Hamed Zamani

Abstract:This paper explores SynTOD, a new synthetic data generation approach for developing end-to-end Task-Oriented Dialogue (TOD) Systems capable of handling complex tasks such as intent classification, slot filling, conversational question-answering, and retrieval-augmented response generation, without relying on crowdsourcing or real-world data. SynTOD utilizes a state transition graph to define the desired behavior of a TOD system and generates diverse, structured conversations through random walks and response simulation using large language models (LLMs). In our experiments, using graph-guided response simulations leads to significant improvements in intent classification, slot filling and response relevance compared to naive single-prompt simulated conversations. We also investigate the end-to-end TOD effectiveness of different base and instruction-tuned LLMs, with and without the constructed synthetic conversations. Finally, we explore how various LLMs can evaluate responses in a TOD system and how well they are correlated with human judgments. Our findings pave the path towards quick development and evaluation of domain-specific TOD systems. We release our datasets, models, and code for research purposes.

Via

Access Paper or Ask Questions