Picture for Lee Sharkey

Lee Sharkey

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Add code
Feb 07, 2025
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

Add code
Jan 24, 2025
Viaarxiv icon

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Add code
Oct 15, 2024
Viaarxiv icon

Bilinear MLPs enable weight-based mechanistic interpretability

Add code
Oct 10, 2024
Viaarxiv icon

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Add code
May 17, 2024
Figure 1 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 2 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 3 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 4 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Add code
Sep 19, 2023
Figure 1 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 2 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 3 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 4 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Viaarxiv icon

A technical note on bilinear layers for interpretability

Add code
May 05, 2023
Viaarxiv icon

Circumventing interpretability: How to defeat mind-readers

Add code
Dec 21, 2022
Viaarxiv icon