Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

François Roewer-Després

$\texttt{ACCORD}$: Closing the Commonsense Measurability Gap

Jun 04, 2024

François Roewer-Després, Jinyue Feng, Zining Zhu, Frank Rudzicz

$Figure 1 for $\texttt{ACCORD}$: Closing the Commonsense Measurability Gap$

$Figure 2 for $\texttt{ACCORD}$: Closing the Commonsense Measurability Gap$

$Figure 3 for $\texttt{ACCORD}$: Closing the Commonsense Measurability Gap$

$Figure 4 for $\texttt{ACCORD}$: Closing the Commonsense Measurability Gap$

Abstract:We present $\texttt{ACCORD}$, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. $\texttt{ACCORD}$ introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, $\texttt{ACCORD}$ can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

* For leaderboard and dataset download, see https://www.codabench.org/competitions/3160/ For source code, see https://github.com/francois-rd/accord/

Via

Access Paper or Ask Questions