Picture for Atticus Geiger

Atticus Geiger

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Add code
Jan 29, 2025
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Add code
Jan 14, 2025
Viaarxiv icon

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Add code
Sep 05, 2024
Viaarxiv icon

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

Add code
Aug 20, 2024
Figure 1 for Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Figure 2 for Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Figure 3 for Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Figure 4 for Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Viaarxiv icon

Updating CLIP to Prefer Descriptions Over Captions

Add code
Jun 12, 2024
Viaarxiv icon

ReFT: Representation Finetuning for Language Models

Add code
Apr 08, 2024
Viaarxiv icon

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Add code
Mar 12, 2024
Figure 1 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 2 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 3 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Viaarxiv icon

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Add code
Feb 27, 2024
Figure 1 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 2 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 3 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 4 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Viaarxiv icon

A Reply to Makelov et al. 's "Interpretability Illusion" Arguments

Add code
Jan 23, 2024
Viaarxiv icon