Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Compact Proofs of Model Performance via Mechanistic Interpretability

Jun 24, 2024

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

Figure 1 for Compact Proofs of Model Performance via Mechanistic Interpretability

Figure 2 for Compact Proofs of Model Performance via Mechanistic Interpretability

Figure 3 for Compact Proofs of Model Performance via Mechanistic Interpretability

Figure 4 for Compact Proofs of Model Performance via Mechanistic Interpretability

Share this with someone who'll enjoy it:

Abstract:In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.

View paper on

Share this with someone who'll enjoy it:

Title:Compact Proofs of Model Performance via Mechanistic Interpretability

Paper and Code