Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yidong Ding

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Jan 06, 2025

Yidong Ding, Jiafei Niu, Ping Yi

Figure 1 for MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Figure 2 for MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Figure 3 for MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Figure 4 for MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Abstract:In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD's first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.

* Accepted by ICTAI 2024

Via

Access Paper or Ask Questions

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

May 22, 2024

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, Gongshen Liu

Figure 1 for TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Figure 2 for TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Figure 3 for TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Figure 4 for TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Abstract:Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.

* 18 pages, 13 figures, 4 tables

Via

Access Paper or Ask Questions