Picture for Sophie Xhonneux

Sophie Xhonneux

A generative approach to LLM harmfulness detection with special red flag tokens

Add code
Feb 22, 2025
Viaarxiv icon

Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives

Add code
Feb 17, 2025
Viaarxiv icon

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Add code
Oct 23, 2024
Figure 1 for Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Figure 2 for Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Figure 3 for Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Figure 4 for Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Viaarxiv icon

Efficient Adversarial Training in LLMs with Continuous Attacks

Add code
May 24, 2024
Figure 1 for Efficient Adversarial Training in LLMs with Continuous Attacks
Figure 2 for Efficient Adversarial Training in LLMs with Continuous Attacks
Figure 3 for Efficient Adversarial Training in LLMs with Continuous Attacks
Figure 4 for Efficient Adversarial Training in LLMs with Continuous Attacks
Viaarxiv icon

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Add code
Feb 14, 2024
Viaarxiv icon

In-Context Learning Can Re-learn Forbidden Tasks

Add code
Feb 08, 2024
Viaarxiv icon