Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ALMANACS: A Simulatability Benchmark for Language Model Explainability

Dec 20, 2023

Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons

Figure 1 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Figure 2 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Figure 3 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Figure 4 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Share this with someone who'll enjoy it:

Abstract:How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. The ALMANACS scenarios span twelve safety-relevant topics such as ethical reasoning and advanced AI behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark. We use ALMANACS to evaluate counterfactuals, rationalizations, attention, and Integrated Gradients explanations. Our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. We conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in ALMANACS remains an open challenge.

* Code is available at https://github.com/edmundmills/ALMANACS}{https://github.com/edmundmills/ALMANACS

View paper on

Share this with someone who'll enjoy it:

Title:ALMANACS: A Simulatability Benchmark for Language Model Explainability

Paper and Code