Picture for Nora Belrose

Nora Belrose

Understanding Gradient Descent through the Training Jacobian

Add code
Dec 09, 2024
Viaarxiv icon

Refusal in LLMs is an Affine Function

Add code
Nov 13, 2024
Viaarxiv icon

Automatically Interpreting Millions of Features in Large Language Models

Add code
Oct 17, 2024
Viaarxiv icon

Balancing Label Quantity and Quality for Scalable Elicitation

Add code
Oct 17, 2024
Viaarxiv icon

Does Transformer Interpretability Transfer to RNNs?

Add code
Apr 09, 2024
Viaarxiv icon

Neural Networks Learn Statistics of Increasing Complexity

Add code
Feb 13, 2024
Viaarxiv icon

Eliciting Latent Knowledge from Quirky Language Models

Add code
Dec 02, 2023
Viaarxiv icon

LEACE: Perfect linear concept erasure in closed form

Add code
Jun 23, 2023
Viaarxiv icon

Eliciting Latent Predictions from Transformers with the Tuned Lens

Add code
Mar 15, 2023
Viaarxiv icon

imitation: Clean Imitation Learning Implementations

Add code
Nov 22, 2022
Figure 1 for imitation: Clean Imitation Learning Implementations
Figure 2 for imitation: Clean Imitation Learning Implementations
Figure 3 for imitation: Clean Imitation Learning Implementations
Figure 4 for imitation: Clean Imitation Learning Implementations
Viaarxiv icon