Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel J. Lee

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Oct 16, 2024

Daniel J. Lee, Stefan Heimersheim

Figure 1 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Figure 2 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Figure 3 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Figure 4 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Abstract:Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

Via

Access Paper or Ask Questions