Picture for Jeffrey Ladish

Jeffrey Ladish

Demonstrating specification gaming in reasoning models

Add code
Feb 18, 2025
Viaarxiv icon

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Add code
Jun 03, 2024
Viaarxiv icon

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Add code
Oct 31, 2023
Viaarxiv icon

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Add code
Oct 31, 2023
Viaarxiv icon

Constitutional AI: Harmlessness from AI Feedback

Add code
Dec 15, 2022
Figure 1 for Constitutional AI: Harmlessness from AI Feedback
Figure 2 for Constitutional AI: Harmlessness from AI Feedback
Figure 3 for Constitutional AI: Harmlessness from AI Feedback
Figure 4 for Constitutional AI: Harmlessness from AI Feedback
Viaarxiv icon

Measuring Progress on Scalable Oversight for Large Language Models

Add code
Nov 11, 2022
Figure 1 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 2 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 3 for Measuring Progress on Scalable Oversight for Large Language Models
Viaarxiv icon