Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Charlie George

Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Oct 16, 2023

Charlie George, Andreas Stuhlmüller

Figure 1 for Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Figure 2 for Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Figure 3 for Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Figure 4 for Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Abstract:Hallucination plagues even frontier LLMs--but how bad is it really for summarizing academic papers? We evaluate Factored Verification, a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using Factored Critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.

* Second Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023

Via

Access Paper or Ask Questions

Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

Jan 05, 2023

Justin Reppert, Ben Rachbach, Charlie George, Luke Stebbing, Jungwon Byun, Maggie Appleton, Andreas Stuhlmüller

Figure 1 for Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

Figure 2 for Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

Figure 3 for Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

Figure 4 for Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

Abstract:Language models (LMs) can perform complex reasoning either end-to-end, with hidden latent state, or compositionally, with transparent intermediate state. Composition offers benefits for interpretability and safety, but may need workflow support and infrastructure to remain competitive. We describe iterated decomposition, a human-in-the-loop workflow for developing and refining compositional LM programs. We improve the performance of compositions by zooming in on failing components and refining them through decomposition, additional context, chain of thought, etc. To support this workflow, we develop ICE, an open-source tool for visualizing the execution traces of LM programs. We apply iterated decomposition to three real-world tasks and improve the accuracy of LM programs over less compositional baselines: describing the placebo used in a randomized controlled trial (25% to 65%), evaluating participant adherence to a medical intervention (53% to 70%), and answering NLP questions on the Qasper dataset (38% to 69%). These applications serve as case studies for a workflow that, if automated, could keep ML systems interpretable and safe even as they scale to increasingly complex tasks.

Via

Access Paper or Ask Questions