Picture for Neel Nanda

Neel Nanda

Google DeepMind

BatchTopK Sparse Autoencoders

Add code
Dec 09, 2024
Viaarxiv icon

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Add code
Nov 28, 2024
Viaarxiv icon

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Add code
Nov 21, 2024
Viaarxiv icon

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Add code
Aug 09, 2024
Figure 1 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 2 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 3 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 4 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Viaarxiv icon

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Add code
Jul 19, 2024
Figure 1 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 2 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 3 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 4 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Viaarxiv icon

Interpreting Attention Layer Outputs with Sparse Autoencoders

Add code
Jun 25, 2024
Viaarxiv icon

Confidence Regulation Neurons in Language Models

Add code
Jun 24, 2024
Viaarxiv icon

Refusal in Language Models Is Mediated by a Single Direction

Add code
Jun 17, 2024
Viaarxiv icon

Transcoders Find Interpretable LLM Feature Circuits

Add code
Jun 17, 2024
Viaarxiv icon

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Add code
May 16, 2024
Viaarxiv icon