Picture for Javier Rando

Javier Rando

Persistent Pre-Training Poisoning of LLMs

Add code
Oct 17, 2024
Viaarxiv icon

Gradient-based Jailbreak Images for Multimodal Fusion Models

Add code
Oct 04, 2024
Viaarxiv icon

An Adversarial Perspective on Machine Unlearning for AI Safety

Add code
Sep 26, 2024
Viaarxiv icon

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Add code
Jun 12, 2024
Viaarxiv icon

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Add code
Apr 22, 2024
Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Viaarxiv icon

Universal Jailbreak Backdoors from Poisoned Human Feedback

Add code
Nov 24, 2023
Viaarxiv icon

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Add code
Nov 06, 2023
Viaarxiv icon

Personas as a Way to Model Truthfulness in Language Models

Add code
Oct 30, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Viaarxiv icon