Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Oscar Obeso

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Nov 21, 2024

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda

Figure 1 for Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Figure 2 for Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Figure 3 for Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Figure 4 for Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Abstract:Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

Via

Access Paper or Ask Questions

Refusal in Language Models Is Mediated by a Single Direction

Jun 17, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, Neel Nanda

Figure 1 for Refusal in Language Models Is Mediated by a Single Direction

Figure 2 for Refusal in Language Models Is Mediated by a Single Direction

Figure 3 for Refusal in Language Models Is Mediated by a Single Direction

Figure 4 for Refusal in Language Models Is Mediated by a Single Direction

Abstract:Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Via

Access Paper or Ask Questions