Picture for Hoagy Cunningham

Hoagy Cunningham

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Viaarxiv icon

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Add code
Sep 19, 2023
Figure 1 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 2 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 3 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Figure 4 for Sparse Autoencoders Find Highly Interpretable Features in Language Models
Viaarxiv icon