Picture for Jake Mendel

Jake Mendel

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

Add code
Jan 24, 2025
Figure 1 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 2 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 3 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 4 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Viaarxiv icon

Mathematical Models of Computation in Superposition

Add code
Aug 10, 2024
Viaarxiv icon

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Add code
May 17, 2024
Figure 1 for The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Figure 2 for The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Figure 3 for The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Figure 4 for The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Viaarxiv icon

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Add code
May 17, 2024
Figure 1 for Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Viaarxiv icon

Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition

Add code
Oct 10, 2023
Viaarxiv icon