Picture for Jeffrey Ladish

Jeffrey Ladish

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Add code
Jun 03, 2024
Viaarxiv icon

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Add code
Oct 31, 2023
Viaarxiv icon

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Add code
Oct 31, 2023
Viaarxiv icon

Constitutional AI: Harmlessness from AI Feedback

Add code
Dec 15, 2022
Figure 1 for Constitutional AI: Harmlessness from AI Feedback
Figure 2 for Constitutional AI: Harmlessness from AI Feedback
Figure 3 for Constitutional AI: Harmlessness from AI Feedback
Figure 4 for Constitutional AI: Harmlessness from AI Feedback
Viaarxiv icon

Measuring Progress on Scalable Oversight for Large Language Models

Add code
Nov 11, 2022
Figure 1 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 2 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 3 for Measuring Progress on Scalable Oversight for Large Language Models
Viaarxiv icon