Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sam Ringer

The Capacity for Moral Self-Correction in Large Language Models

Feb 18, 2023

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez(+39 more)

Figure 1 for The Capacity for Moral Self-Correction in Large Language Models

Figure 2 for The Capacity for Moral Self-Correction in Large Language Models

Figure 3 for The Capacity for Moral Self-Correction in Large Language Models

Figure 4 for The Capacity for Moral Self-Correction in Large Language Models

Abstract:We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

Via

Access Paper or Ask Questions

Discovering Language Model Behaviors with Model-Written Evaluations

Dec 19, 2022

Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath(+53 more)

Figure 1 for Discovering Language Model Behaviors with Model-Written Evaluations

Figure 2 for Discovering Language Model Behaviors with Model-Written Evaluations

Figure 3 for Discovering Language Model Behaviors with Model-Written Evaluations

Figure 4 for Discovering Language Model Behaviors with Model-Written Evaluations

Abstract:As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

* for associated data visualizations, see https://www.evals.anthropic.com/model-written/ for full datasets, see https://github.com/anthropics/evals

Via

Access Paper or Ask Questions

Constitutional AI: Harmlessness from AI Feedback

Dec 15, 2022

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon(+41 more)

Figure 1 for Constitutional AI: Harmlessness from AI Feedback

Figure 2 for Constitutional AI: Harmlessness from AI Feedback

Figure 3 for Constitutional AI: Harmlessness from AI Feedback

Figure 4 for Constitutional AI: Harmlessness from AI Feedback

Abstract:As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

Via

Access Paper or Ask Questions

Language Models (Mostly) Know What They Know

Jul 16, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson(+26 more)

Figure 1 for Language Models (Mostly) Know What They Know

Figure 2 for Language Models (Mostly) Know What They Know

Figure 3 for Language Models (Mostly) Know What They Know

Figure 4 for Language Models (Mostly) Know What They Know

Abstract:We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

* 23+17 pages; refs added, typos fixed

Via

Access Paper or Ask Questions

Hierarchical Quantized Autoencoders

Feb 19, 2020

Will Williams, Sam Ringer, Tom Ash, John Hughes, David MacLeod, Jamie Dougherty

Figure 1 for Hierarchical Quantized Autoencoders

Figure 2 for Hierarchical Quantized Autoencoders

Figure 3 for Hierarchical Quantized Autoencoders

Figure 4 for Hierarchical Quantized Autoencoders

Abstract:Despite progress in training neural networks for lossy image compression, current approaches fail to maintain both perceptual quality and high-level features at very low bitrates. Encouraged by recent success in learning discrete representations with Vector Quantized Variational AutoEncoders (VQ-VAEs), we motivate the use of a hierarchy of VQ-VAEs to attain high factors of compression. We show that the combination of quantization and hierarchical latent structure aids likelihood-based image compression. This leads us to introduce a more probabilistic framing of the VQ-VAE, of which previous work is a limiting case. Our hierarchy produces a Markovian series of latent variables that reconstruct high-quality images which retain semantically meaningful features. These latents can then be further used to generate realistic samples. We provide qualitative and quantitative evaluations of reconstructions and samples on the CelebA and MNIST datasets.

Via

Access Paper or Ask Questions

Texture Bias Of CNNs Limits Few-Shot Classification Performance

Oct 18, 2019

Sam Ringer, Will Williams, Tom Ash, Remi Francis, David MacLeod

Figure 1 for Texture Bias Of CNNs Limits Few-Shot Classification Performance

Figure 2 for Texture Bias Of CNNs Limits Few-Shot Classification Performance

Figure 3 for Texture Bias Of CNNs Limits Few-Shot Classification Performance

Figure 4 for Texture Bias Of CNNs Limits Few-Shot Classification Performance

Abstract:Accurate image classification given small amounts of labelled data (few-shot classification) remains an open problem in computer vision. In this work we examine how the known texture bias of Convolutional Neural Networks (CNNs) affects few-shot classification performance. Although texture bias can help in standard image classification, in this work we show it significantly harms few-shot classification performance. After correcting this bias we demonstrate state-of-the-art performance on the competitive miniImageNet task using a method far simpler than the current best performing few-shot learning approaches.

Via

Access Paper or Ask Questions