Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Course-Correction: Safety Alignment Using Synthetic Preferences

Jul 23, 2024

Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu

Figure 1 for Course-Correction: Safety Alignment Using Synthetic Preferences

Figure 2 for Course-Correction: Safety Alignment Using Synthetic Preferences

Figure 3 for Course-Correction: Safety Alignment Using Synthetic Preferences

Figure 4 for Course-Correction: Safety Alignment Using Synthetic Preferences

Share this with someone who'll enjoy it:

Abstract:The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C$^2$-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C$^2$-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B}, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

* Dataset and script will be available at https://github.com/pillowsofwind/Course-Correction

View paper on

Share this with someone who'll enjoy it:

Title:Course-Correction: Safety Alignment Using Synthetic Preferences

Paper and Code