Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Aug 27, 2024

Wenxuan Zhang, Philip H. S. Torr, Mohamed Elhoseiny, Adel Bibi

Figure 1 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Figure 2 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Figure 3 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Figure 4 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Share this with someone who'll enjoy it:

Abstract:Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.

View paper on

Share this with someone who'll enjoy it:

Title:Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Paper and Code