Picture for Arthur Conmy

Arthur Conmy

Improving Steering Vectors by Targeting Sparse Autoencoder Features

Add code
Nov 04, 2024
Viaarxiv icon

Applying sparse autoencoders to unlearn knowledge in language models

Add code
Oct 25, 2024
Viaarxiv icon

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Add code
Aug 09, 2024
Figure 1 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 2 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 3 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 4 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Viaarxiv icon

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Add code
Jul 19, 2024
Figure 1 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 2 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 3 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 4 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Viaarxiv icon

Interpreting Attention Layer Outputs with Sparse Autoencoders

Add code
Jun 25, 2024
Viaarxiv icon

Improving Dictionary Learning with Gated Sparse Autoencoders

Add code
Apr 30, 2024
Viaarxiv icon

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

Add code
Dec 14, 2023
Viaarxiv icon

Attribution Patching Outperforms Automated Circuit Discovery

Add code
Oct 16, 2023
Viaarxiv icon

Copy Suppression: Comprehensively Understanding an Attention Head

Add code
Oct 06, 2023
Viaarxiv icon

Towards Automated Circuit Discovery for Mechanistic Interpretability

Add code
Apr 28, 2023
Figure 1 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 2 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 3 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 4 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Viaarxiv icon