Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Provable Guarantees for Model Performance via Mechanistic Interpretability

Jun 18, 2024

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

Figure 1 for Provable Guarantees for Model Performance via Mechanistic Interpretability

Figure 2 for Provable Guarantees for Model Performance via Mechanistic Interpretability

Figure 3 for Provable Guarantees for Model Performance via Mechanistic Interpretability

Figure 4 for Provable Guarantees for Model Performance via Mechanistic Interpretability

Share this with someone who'll enjoy it:

Abstract:In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.

View paper on

Share this with someone who'll enjoy it:

Title:Provable Guarantees for Model Performance via Mechanistic Interpretability

Paper and Code