Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Oct 11, 2024

Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez

Figure 1 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Figure 2 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Figure 3 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Figure 4 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Share this with someone who'll enjoy it:

Abstract:Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

* Tingchen Fu and Fazl Barez are core research contributors

View paper on

Share this with someone who'll enjoy it:

Title:PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Paper and Code