Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Majercak

Steering Language Model Refusal with Sparse Autoencoders

Nov 18, 2024

Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde

Figure 1 for Steering Language Model Refusal with Sparse Autoencoders

Figure 2 for Steering Language Model Refusal with Sparse Autoencoders

Figure 3 for Steering Language Model Refusal with Sparse Autoencoders

Figure 4 for Steering Language Model Refusal with Sparse Autoencoders

Abstract:Responsible practices for deploying language models include guiding models to recognize and refuse answering prompts that are considered unsafe, while complying with safe prompts. Achieving such behavior typically requires updating model weights, which is costly and inflexible. We explore opportunities to steering model activations at inference time, which does not require updating weights. Using sparse autoencoders, we identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms, including challenging multi-turn attacks. However, we discover that feature steering can adversely affect overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising approach for enhancing language model safety, but that more research is needed to mitigate feature steerings adverse effects on performance.

Via

Access Paper or Ask Questions

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Jul 18, 2024

Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang(+20 more)

Figure 1 for Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Figure 2 for Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Figure 3 for Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Figure 4 for Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Abstract:Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks.

Via

Access Paper or Ask Questions