Picture for Lee Sharkey

Lee Sharkey

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Add code
Feb 07, 2025
Figure 1 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Figure 2 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Figure 3 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Figure 4 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

Add code
Jan 24, 2025
Viaarxiv icon

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Add code
Oct 15, 2024
Viaarxiv icon

Bilinear MLPs enable weight-based mechanistic interpretability

Add code
Oct 10, 2024
Viaarxiv icon

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Add code
May 17, 2024
Figure 1 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 2 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 3 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 4 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Add code
Sep 19, 2023
Figure 1 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 2 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 3 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 4 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Viaarxiv icon

A technical note on bilinear layers for interpretability

Add code
May 05, 2023
Viaarxiv icon

Circumventing interpretability: How to defeat mind-readers

Add code
Dec 21, 2022
Viaarxiv icon