Picture for Tinghao Xie

Tinghao Xie

On Evaluating the Durability of Safeguards for Open-Weight LLMs

Add code
Dec 10, 2024
Viaarxiv icon

Fantastic Copyrighted Beasts and How (Not) to Generate Them

Add code
Jun 20, 2024
Viaarxiv icon

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Add code
Jun 20, 2024
Figure 1 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Figure 2 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Figure 3 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Figure 4 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Viaarxiv icon

AI Risk Management Should Incorporate Both Safety and Security

Add code
May 29, 2024
Figure 1 for AI Risk Management Should Incorporate Both Safety and Security
Viaarxiv icon

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Add code
Feb 07, 2024
Figure 1 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Figure 2 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Figure 3 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Figure 4 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Viaarxiv icon

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Add code
Oct 05, 2023
Figure 1 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 2 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 3 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 4 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Viaarxiv icon

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Add code
Aug 23, 2023
Viaarxiv icon

Fight Poison with Poison: Detecting Backdoor Poison Samples via Decoupling Benign Correlations

Add code
May 26, 2022
Figure 1 for Fight Poison with Poison: Detecting Backdoor Poison Samples via Decoupling Benign Correlations
Figure 2 for Fight Poison with Poison: Detecting Backdoor Poison Samples via Decoupling Benign Correlations
Figure 3 for Fight Poison with Poison: Detecting Backdoor Poison Samples via Decoupling Benign Correlations
Figure 4 for Fight Poison with Poison: Detecting Backdoor Poison Samples via Decoupling Benign Correlations
Viaarxiv icon

Circumventing Backdoor Defenses That Are Based on Latent Separability

Add code
May 26, 2022
Figure 1 for Circumventing Backdoor Defenses That Are Based on Latent Separability
Figure 2 for Circumventing Backdoor Defenses That Are Based on Latent Separability
Figure 3 for Circumventing Backdoor Defenses That Are Based on Latent Separability
Figure 4 for Circumventing Backdoor Defenses That Are Based on Latent Separability
Viaarxiv icon

Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks

Add code
Nov 25, 2021
Figure 1 for Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
Figure 2 for Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
Figure 3 for Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
Figure 4 for Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks
Viaarxiv icon