Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Havrilla

ARB: Advanced Reasoning Benchmark for Large Language Models

Jul 28, 2023

Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki

Figure 1 for ARB: Advanced Reasoning Benchmark for Large Language Models

Figure 2 for ARB: Advanced Reasoning Benchmark for Large Language Models

Figure 3 for ARB: Advanced Reasoning Benchmark for Large Language Models

Figure 4 for ARB: Advanced Reasoning Benchmark for Large Language Models

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks. In order to improve both automatic and assisted evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. Further, we conduct a human evaluation of the symbolic subset of ARB, finding promising agreement between annotators and GPT-4 rubric evaluation scores.

* Submitted to NeurIPS Datasets and Benchmarks Track

Via

Access Paper or Ask Questions

Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Oct 14, 2022

Louis Castricato, Alexander Havrilla, Shahbuland Matiana, Michael Pieler, Anbang Ye, Ian Yang, Spencer Frazier, Mark Riedl

Figure 1 for Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Figure 2 for Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Figure 3 for Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Figure 4 for Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Abstract:Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences. Existing methods to control for story preference utilize prompt engineering which is labor intensive and often inconsistent. They may also use logit-manipulation methods which require annotated datasets to exist for the desired attributes. To address these issues, we first train a contrastive bi-encoder model to align stories with corresponding human critiques, named CARP, building a general purpose preference model. This is subsequently used as a reward function to fine-tune a generative language model via reinforcement learning. However, simply fine-tuning a generative language model with a contrastive reward model does not always reliably result in a story generation system capable of generating stories that meet user preferences. To increase story generation robustness we further fine-tune the contrastive reward model using a prompt-learning technique. A human participant study is then conducted comparing generations from our full system, ablations, and two baselines. We show that the full fine-tuning pipeline results in a story generator preferred over a LLM 20x as large as well as logit-based methods. This motivates the use of contrastive learning for general purpose human preference modeling.

Via

Access Paper or Ask Questions