Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Georg Lange

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Dec 06, 2023

Aleksandar Makelov, Georg Lange, Neel Nanda

Figure 1 for Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Figure 2 for Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Figure 3 for Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Figure 4 for Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Abstract:Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability. To contextualize our findings, we also show what a success case looks like in a task (indirect object identification) where prior manual circuit analysis informs an understanding of the location of a feature. We explore the additional evidence needed to argue that a patched subspace is faithful.

* NeurIPS 2023 Workshop on Attributing Model Behavior at Scale

Via

Access Paper or Ask Questions