Picture for Xiangyu Qi

Xiangyu Qi

On Evaluating the Durability of Safeguards for Open-Weight LLMs

Add code
Dec 10, 2024
Viaarxiv icon

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Add code
Jun 25, 2024
Viaarxiv icon

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Add code
Jun 20, 2024
Viaarxiv icon

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Add code
Jun 10, 2024
Viaarxiv icon

AI Risk Management Should Incorporate Both Safety and Security

Add code
May 29, 2024
Viaarxiv icon

Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment

Add code
Feb 27, 2024
Viaarxiv icon

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Add code
Feb 07, 2024
Viaarxiv icon

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Add code
Oct 05, 2023
Figure 1 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 2 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 3 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 4 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Viaarxiv icon

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Add code
Aug 23, 2023
Viaarxiv icon

Visual Adversarial Examples Jailbreak Large Language Models

Add code
Jun 22, 2023
Figure 1 for Visual Adversarial Examples Jailbreak Large Language Models
Figure 2 for Visual Adversarial Examples Jailbreak Large Language Models
Figure 3 for Visual Adversarial Examples Jailbreak Large Language Models
Figure 4 for Visual Adversarial Examples Jailbreak Large Language Models
Viaarxiv icon