Picture for Aidan Ewart

Aidan Ewart

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Add code
Oct 16, 2024
Viaarxiv icon

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Add code
Jul 22, 2024
Viaarxiv icon

Eight Methods to Evaluate Robust Unlearning in LLMs

Add code
Feb 26, 2024
Viaarxiv icon

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Add code
Sep 19, 2023
Figure 1 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 2 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 3 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 4 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Viaarxiv icon