Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ali Khalifa Almansoori

Alignment with Preference Optimization Is All You Need for LLM Safety

Sep 12, 2024

Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid

Figure 1 for Alignment with Preference Optimization Is All You Need for LLM Safety

Figure 2 for Alignment with Preference Optimization Is All You Need for LLM Safety

Figure 3 for Alignment with Preference Optimization Is All You Need for LLM Safety

Figure 4 for Alignment with Preference Optimization Is All You Need for LLM Safety

Abstract:We demonstrate that preference optimization methods can effectively enhance LLM safety. Applying various alignment techniques to the Falcon 11B model using safety datasets, we achieve a significant boost in global safety score (from $57.64\%$ to $99.90\%$) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over $0.6$ to less than $0.07$. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

Via

Access Paper or Ask Questions