Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yeu-Tong Lau

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Mar 13, 2025

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde(+4 more)

Abstract:Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: https://saebench.xyz

Via

Access Paper or Ask Questions

Applying sparse autoencoders to unlearn knowledge in language models

Oct 25, 2024

Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

Figure 1 for Applying sparse autoencoders to unlearn knowledge in language models

Figure 2 for Applying sparse autoencoders to unlearn knowledge in language models

Figure 3 for Applying sparse autoencoders to unlearn knowledge in language models

Figure 4 for Applying sparse autoencoders to unlearn knowledge in language models

Abstract:We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from language models. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it language models. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn biology-related knowledge with minimal side-effects. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the existing fine-tuning based techniques.

Via

Access Paper or Ask Questions

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Oct 14, 2023

James Dao, Yeu-Tong Lau, Can Rager, Jett Janiak

Figure 1 for An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Figure 2 for An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Figure 3 for An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Figure 4 for An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Abstract:We provide concrete evidence for memory management in a 4-layer transformer. Specifically, we identify clean-up behavior, in which model components consistently remove the output of preceeding components during a forward pass. Our findings suggest that the interpretability technique Direct Logit Attribution provides misleading results. We show explicit examples where this technique is inaccurate, as it does not account for clean-up behavior.

Via

Access Paper or Ask Questions