Picture for Henry Sleight

Henry Sleight

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Add code
Nov 12, 2024
Viaarxiv icon

Looking Inward: Language Models Can Learn About Themselves by Introspection

Add code
Oct 17, 2024
Viaarxiv icon

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Add code
Jul 22, 2024
Viaarxiv icon

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Add code
Jul 21, 2024
Viaarxiv icon

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Add code
Apr 01, 2024
Viaarxiv icon