Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuliang Sun

Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Jun 17, 2024

Shangqing Tu, Zhuoran Pan, Wenxuan Wang, Zhexin Zhang, Yuliang Sun, Jifan Yu, Hongning Wang, Lei Hou, Juanzi Li

Figure 1 for Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Figure 2 for Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Figure 3 for Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Figure 4 for Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Abstract:Large language models (LLMs) have been increasingly applied to various domains, which triggers increasing concerns about LLMs' safety on specialized domains, e.g. medicine. However, testing the domain-specific safety of LLMs is challenging due to the lack of domain knowledge-driven attacks in existing benchmarks. To bridge this gap, we propose a new task, knowledge-to-jailbreak, which aims to generate jailbreaks from domain knowledge to evaluate the safety of LLMs when applied to those domains. We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator, to produce domain knowledge-specific jailbreaks. Experiments on 13 domains and 8 target LLMs demonstrate the effectiveness of jailbreak-generator in generating jailbreaks that are both relevant to the given knowledge and harmful to the target LLMs. We also apply our method to an out-of-domain knowledge base, showing that jailbreak-generator can generate jailbreaks that are comparable in harmfulness to those crafted by human experts. Data and code: https://github.com/THU-KEG/Knowledge-to-Jailbreak/.

* 18 pages, 14 figures, 11 tables

Via

Access Paper or Ask Questions

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Nov 13, 2023

Shangqing Tu, Yuliang Sun, Yushi Bai, Jifan Yu, Lei Hou, Juanzi Li

Figure 1 for WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Figure 2 for WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Figure 3 for WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Figure 4 for WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Abstract:To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For \textbf{benchmarking procedure}, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For \textbf{task selection}, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For \textbf{evaluation metric}, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at \url{https://github.com/THU-KEG/WaterBench}.

* 22pages, 7 figures

Via

Access Paper or Ask Questions