Picture for Simon Lermen

Simon Lermen

Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Add code
Apr 10, 2025
Figure 1 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Figure 2 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Figure 3 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Figure 4 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Viaarxiv icon

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Add code
Oct 08, 2024
Figure 1 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Figure 2 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Figure 3 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Figure 4 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Viaarxiv icon

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Add code
Dec 08, 2023
Figure 1 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability
Figure 2 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability
Figure 3 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability
Figure 4 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability
Viaarxiv icon

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Add code
Oct 31, 2023
Viaarxiv icon

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Add code
Oct 31, 2023
Figure 1 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 2 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 3 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 4 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Viaarxiv icon

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Add code
Jul 03, 2023
Figure 1 for Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
Figure 2 for Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
Viaarxiv icon