Picture for Logan Smith

Logan Smith

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Add code
Jul 31, 2024
Viaarxiv icon

Eliciting Latent Predictions from Transformers with the Tuned Lens

Add code
Mar 15, 2023
Viaarxiv icon

Researching Alignment Research: Unsupervised Analysis

Add code
Jun 06, 2022
Figure 1 for Researching Alignment Research: Unsupervised Analysis
Figure 2 for Researching Alignment Research: Unsupervised Analysis
Figure 3 for Researching Alignment Research: Unsupervised Analysis
Figure 4 for Researching Alignment Research: Unsupervised Analysis
Viaarxiv icon