Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Jul 19, 2024

Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso

Figure 1 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Figure 2 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Figure 3 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Figure 4 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Share this with someone who'll enjoy it:

Abstract:Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train these neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

View paper on

Share this with someone who'll enjoy it:

Title:InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Paper and Code