Picture for Nora Belrose

Nora Belrose

Examining Two Hop Reasoning Through Information Content Scaling

Add code
Feb 05, 2025
Viaarxiv icon

Slowing Learning by Erasing Simple Features

Add code
Feb 05, 2025
Viaarxiv icon

Converting MLPs into Polynomials in Closed Form

Add code
Feb 03, 2025
Viaarxiv icon

Partially Rewriting a Transformer in Natural Language

Add code
Jan 31, 2025
Viaarxiv icon

Transcoders Beat Sparse Autoencoders for Interpretability

Add code
Jan 31, 2025
Viaarxiv icon

Estimating the Probability of Sampling a Trained Neural Network at Random

Add code
Jan 31, 2025
Viaarxiv icon

Sparse Autoencoders Trained on the Same Data Learn Different Features

Add code
Jan 29, 2025
Viaarxiv icon

Understanding Gradient Descent through the Training Jacobian

Add code
Dec 09, 2024
Viaarxiv icon

Refusal in LLMs is an Affine Function

Add code
Nov 13, 2024
Viaarxiv icon

Balancing Label Quantity and Quality for Scalable Elicitation

Add code
Oct 17, 2024
Viaarxiv icon