Picture for Lucy Farnik

Lucy Farnik

Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Add code
Feb 25, 2025
Viaarxiv icon

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

Add code
Jan 29, 2025
Viaarxiv icon

Inducing Human-like Biases in Moral Reasoning Language Models

Add code
Nov 23, 2024
Figure 1 for Inducing Human-like Biases in Moral Reasoning Language Models
Figure 2 for Inducing Human-like Biases in Moral Reasoning Language Models
Figure 3 for Inducing Human-like Biases in Moral Reasoning Language Models
Figure 4 for Inducing Human-like Biases in Moral Reasoning Language Models
Viaarxiv icon

Residual Stream Analysis with Multi-Layer SAEs

Add code
Sep 06, 2024
Figure 1 for Residual Stream Analysis with Multi-Layer SAEs
Figure 2 for Residual Stream Analysis with Multi-Layer SAEs
Figure 3 for Residual Stream Analysis with Multi-Layer SAEs
Figure 4 for Residual Stream Analysis with Multi-Layer SAEs
Viaarxiv icon

STARC: A General Framework For Quantifying Differences Between Reward Functions

Add code
Sep 26, 2023
Viaarxiv icon