Picture for Xiangyu Qi

Xiangyu Qi

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Add code
Jun 25, 2024
Viaarxiv icon

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Add code
Jun 20, 2024
Viaarxiv icon

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Add code
Jun 10, 2024
Viaarxiv icon

AI Risk Management Should Incorporate Both Safety and Security

Add code
May 29, 2024
Viaarxiv icon

Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment

Add code
Feb 27, 2024
Viaarxiv icon

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Add code
Feb 07, 2024
Viaarxiv icon

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Add code
Oct 05, 2023
Viaarxiv icon

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Add code
Aug 23, 2023
Viaarxiv icon

Visual Adversarial Examples Jailbreak Large Language Models

Add code
Jun 22, 2023
Viaarxiv icon

Uncovering Adversarial Risks of Test-Time Adaptation

Add code
Feb 04, 2023
Viaarxiv icon