Picture for Yeu-Tong Lau

Yeu-Tong Lau

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Applying sparse autoencoders to unlearn knowledge in language models

Add code
Oct 25, 2024
Viaarxiv icon

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Add code
Oct 14, 2023
Viaarxiv icon