Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vikrant Varma

An Approach to Technical AGI Safety and Security

Apr 02, 2025

Rohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa(+20 more)

Abstract:Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.

Via

Access Paper or Ask Questions

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Aug 09, 2024

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda

Figure 1 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Figure 2 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Figure 3 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Figure 4 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Abstract:Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope

* 12 main text pages, and 14 pages of acknowledgements, references and appendices

Via

Access Paper or Ask Questions

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Jul 19, 2024

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda

Figure 1 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Figure 2 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Figure 3 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Figure 4 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Abstract:Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

Via

Access Paper or Ask Questions

Improving Dictionary Learning with Gated Sparse Autoencoders

Apr 30, 2024

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Figure 1 for Improving Dictionary Learning with Gated Sparse Autoencoders

Figure 2 for Improving Dictionary Learning with Gated Sparse Autoencoders

Figure 3 for Improving Dictionary Learning with Gated Sparse Autoencoders

Figure 4 for Improving Dictionary Learning with Gated Sparse Autoencoders

Abstract:Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

* 15 main text pages, 22 appendix pages

Via

Access Paper or Ask Questions

Challenges with unsupervised LLM knowledge discovery

Dec 18, 2023

Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

Figure 1 for Challenges with unsupervised LLM knowledge discovery

Figure 2 for Challenges with unsupervised LLM knowledge discovery

Figure 3 for Challenges with unsupervised LLM knowledge discovery

Figure 4 for Challenges with unsupervised LLM knowledge discovery

Abstract:We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

* 12 pages (38 including references and appendices). First three authors equal contribution, randomised order

Via

Access Paper or Ask Questions

Explaining grokking through circuit efficiency

Sep 05, 2023

Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

Figure 1 for Explaining grokking through circuit efficiency

Figure 2 for Explaining grokking through circuit efficiency

Figure 3 for Explaining grokking through circuit efficiency

Figure 4 for Explaining grokking through circuit efficiency

Abstract:One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation. We propose that grokking occurs when the task admits a generalising solution and a memorising solution, where the generalising solution is slower to learn but more efficient, producing larger logits with the same parameter norm. We hypothesise that memorising circuits become more inefficient with larger training datasets while generalising circuits do not, suggesting there is a critical dataset size at which memorisation and generalisation are equally efficient. We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation. Most strikingly, we demonstrate two novel and surprising behaviours: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.

Via

Access Paper or Ask Questions

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Oct 04, 2022

Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton

Figure 1 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Figure 2 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Figure 3 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Figure 4 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Abstract:The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgeneralization. Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. We demonstrate that goal misgeneralization can occur in practical systems by providing several examples in deep learning systems across a variety of domains. Extrapolating forward to more capable systems, we provide hypotheticals that illustrate how goal misgeneralization could lead to catastrophic risk. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.

Via

Access Paper or Ask Questions

Safe Deep RL in 3D Environments using Human Feedback

Jan 21, 2022

Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, Jan Leike

Figure 1 for Safe Deep RL in 3D Environments using Human Feedback

Figure 2 for Safe Deep RL in 3D Environments using Human Feedback

Figure 3 for Safe Deep RL in 3D Environments using Human Feedback

Figure 4 for Safe Deep RL in 3D Environments using Human Feedback

Abstract:Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks. A recently introduced technique, ReQueST, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans - whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality. In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.

Via

Access Paper or Ask Questions

Imitating Interactive Intelligence

Jan 21, 2021

Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik(+19 more)

Figure 1 for Imitating Interactive Intelligence

Figure 2 for Imitating Interactive Intelligence

Figure 3 for Imitating Interactive Intelligence

Figure 4 for Imitating Interactive Intelligence

Abstract:A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour. Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioural tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behaviour beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalise beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioural imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount.

Via

Access Paper or Ask Questions