Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nathanael Schärli

Evaluation of retrieval-based QA on QUEST-LOFT

Nov 08, 2025

Nathan Scales, Nathanael Schärli, Olivier Bousquet

Abstract:Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.

Via

Access Paper or Ask Questions

Teaching Large Language Models to Self-Debug

Apr 11, 2023

Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou

Figure 1 for Teaching Large Language Models to Self-Debug

Figure 2 for Teaching Large Language Models to Self-Debug

Figure 3 for Teaching Large Language Models to Self-Debug

Figure 4 for Teaching Large Language Models to Self-Debug

Abstract:Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.

Via

Access Paper or Ask Questions

Large Language Models Can Be Easily Distracted by Irrelevant Context

Feb 13, 2023

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, Denny Zhou

Figure 1 for Large Language Models Can Be Easily Distracted by Irrelevant Context

Figure 2 for Large Language Models Can Be Easily Distracted by Irrelevant Context

Figure 3 for Large Language Models Can Be Easily Distracted by Irrelevant Context

Figure 4 for Large Language Models Can Be Easily Distracted by Irrelevant Context

Abstract:Large language models have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be influenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for large language models, and find that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information.

Via

Access Paper or Ask Questions

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Oct 17, 2022

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou(+1 more)

Figure 1 for Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Figure 2 for Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Figure 3 for Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Figure 4 for Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Abstract:BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

* GitHub repository: https://github.com/suzgunmirac/BIG-Bench-Hard

Via

Access Paper or Ask Questions

Compositional Semantic Parsing with Large Language Models

Sep 30, 2022

Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, Denny Zhou

Figure 1 for Compositional Semantic Parsing with Large Language Models

Figure 2 for Compositional Semantic Parsing with Large Language Models

Figure 3 for Compositional Semantic Parsing with Large Language Models

Figure 4 for Compositional Semantic Parsing with Large Language Models

Abstract:Humans can reason compositionally when presented with new tasks. Previous research shows that appropriate prompting techniques enable large language models (LLMs) to solve artificial compositional generalization tasks such as SCAN. In this work, we identify additional challenges in more realistic semantic parsing tasks with larger vocabulary and refine these prompting techniques to address them. Our best method is based on least-to-most prompting: it decomposes the problem using prompting-based syntactic parsing, then uses this decomposition to select appropriate exemplars and to sequentially generate the semantic parse. This method allows us to set a new state of the art for CFQ while requiring only 1% of the training data used by traditional approaches. Due to the general nature of our approach, we expect similar efforts will lead to new results in other tasks and domains, especially for knowledge-intensive applications.

* Fixed metadata. No other changes

Via

Access Paper or Ask Questions

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

May 21, 2022

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, Ed Chi

Figure 1 for Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Figure 2 for Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Figure 3 for Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Figure 4 for Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Abstract:We propose a novel prompting strategy, least-to-most prompting, that enables large language models to better perform multi-step reasoning tasks. Least-to-most prompting first reduces a complex problem into a list of subproblems, and then sequentially solves the subproblems, whereby solving a given subproblem is facilitated by the model's answers to previously solved subproblems. Experiments on symbolic manipulation, compositional generalization and numerical reasoning demonstrate that least-to-most prompting can generalize to examples that are harder than those seen in the prompt context, outperforming other prompting-based approaches by a large margin. A notable empirical result is that the GPT-3 code-davinci-002 model with least-to-most-prompting can solve the SCAN benchmark with an accuracy of 99.7% using 14 examples. As a comparison, the neural-symbolic models in the literature specialized for solving SCAN are trained with the full training set of more than 15,000 examples.

Via

Access Paper or Ask Questions

*-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task

Dec 15, 2020

Dmitry Tsarkov, Tibor Tihon, Nathan Scales, Nikola Momchev, Danila Sinopalnikov, Nathanael Schärli

Figure 1 for *-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task

Figure 2 for *-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task

Figure 3 for *-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task

Figure 4 for *-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task

Abstract:We present *-CFQ ("star-CFQ"): a suite of large-scale datasets of varying scope based on the CFQ semantic parsing benchmark, designed for principled investigation of the scalability of machine learning systems in a realistic compositional task setting. Using this suite, we conduct a series of experiments investigating the ability of Transformers to benefit from increased training size under conditions of fixed computational cost. We show that compositional generalization remains a challenge at all training sizes, and we show that increasing the scope of natural language leads to consistently higher error rates, which are only partially offset by increased training data. We further show that while additional training data from a related domain improves the accuracy in data-starved situations, this improvement is limited and diminishes as the distance from the related domain to the target domain increases.

* Accepted, AAAI-21

Via

Access Paper or Ask Questions

Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures

Jul 21, 2020

Daniel Furrer, Marc van Zee, Nathan Scales, Nathanael Schärli

Figure 1 for Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures

Figure 2 for Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures

Figure 3 for Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures

Figure 4 for Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures

Abstract:While mainstream machine learning methods are known to have limited ability to compositionally generalize, new architectures and techniques continue to be proposed to address this limitation. We investigate state-of-the-art techniques and architectures in order to assess their effectiveness in improving compositional generalization in semantic parsing tasks based on the SCAN and CFQ datasets. We show that masked language model (MLM) pre-training rivals SCAN-inspired architectures on primitive holdout splits. On a more complex compositional task, we show that pre-training leads to significant improvements in performance vs. comparable non-pre-trained models, whereas architectures proposed to encourage compositional generalization on SCAN or in the area of algorithm learning fail to lead to significant improvements. We establish a new state of the art on the CFQ compositional generalization benchmark using MLM pre-training together with an intermediate representation.

Via

Access Paper or Ask Questions

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Dec 20, 2019

Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon(+4 more)

Figure 1 for Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Figure 2 for Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Figure 3 for Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Figure 4 for Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Abstract:State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.

* Accepted for publication at ICLR 2020

Via

Access Paper or Ask Questions