Picture for Aleksandar Makelov

Aleksandar Makelov

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Add code
May 16, 2024
Viaarxiv icon

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Add code
Dec 06, 2023
Viaarxiv icon

Rethinking Backdoor Attacks

Add code
Jul 19, 2023
Viaarxiv icon

Towards Deep Learning Models Resistant to Adversarial Attacks

Add code
Nov 09, 2017
Figure 1 for Towards Deep Learning Models Resistant to Adversarial Attacks
Figure 2 for Towards Deep Learning Models Resistant to Adversarial Attacks
Figure 3 for Towards Deep Learning Models Resistant to Adversarial Attacks
Figure 4 for Towards Deep Learning Models Resistant to Adversarial Attacks
Viaarxiv icon