Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Residual Stream Analysis with Multi-Layer SAEs

Sep 06, 2024

Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison

Figure 1 for Residual Stream Analysis with Multi-Layer SAEs

Figure 2 for Residual Stream Analysis with Multi-Layer SAEs

Figure 3 for Residual Stream Analysis with Multi-Layer SAEs

Figure 4 for Residual Stream Analysis with Multi-Layer SAEs

Share this with someone who'll enjoy it:

Abstract:Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, standard SAEs are trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer simultaneously. The residual stream is usually understood as preserving information across layers, so we expected to, and did, find individual SAE features that are active at multiple layers. Interestingly, while a single SAE feature is active at different layers for different prompts, for a single prompt, we find that a single feature is far more likely to be active at a single layer. For larger underlying models, we find that the cosine similarities between adjacent layers in the residual stream are higher, so we expect more features to be active at multiple layers. These results show that MLSAEs are a promising method to study information flow in transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.

* 16 pages, 12 figures

View paper on

Share this with someone who'll enjoy it:

Title:Residual Stream Analysis with Multi-Layer SAEs

Paper and Code