Picture for Simon Lermen

Simon Lermen

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Add code
Oct 08, 2024
Viaarxiv icon

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Add code
Dec 08, 2023
Viaarxiv icon

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Add code
Oct 31, 2023
Viaarxiv icon

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Add code
Oct 31, 2023
Viaarxiv icon

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Add code
Jul 03, 2023
Viaarxiv icon