Picture for Lee Sharkey

Lee Sharkey

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Add code
Oct 15, 2024
Viaarxiv icon

Bilinear MLPs enable weight-based mechanistic interpretability

Add code
Oct 10, 2024
Viaarxiv icon

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Add code
May 17, 2024
Figure 1 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 2 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 3 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 4 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Add code
Sep 19, 2023
Figure 1 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 2 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 3 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 4 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Viaarxiv icon

A technical note on bilinear layers for interpretability

Add code
May 05, 2023
Viaarxiv icon

Circumventing interpretability: How to defeat mind-readers

Add code
Dec 21, 2022
Viaarxiv icon

Interpreting Neural Networks through the Polytope Lens

Add code
Nov 22, 2022
Viaarxiv icon

Objective Robustness in Deep Reinforcement Learning

Add code
Jun 08, 2021
Figure 1 for Objective Robustness in Deep Reinforcement Learning
Figure 2 for Objective Robustness in Deep Reinforcement Learning
Figure 3 for Objective Robustness in Deep Reinforcement Learning
Figure 4 for Objective Robustness in Deep Reinforcement Learning
Viaarxiv icon