Picture for Tom Henighan

Tom Henighan

Specific versus General Principles for Constitutional AI

Add code
Oct 20, 2023
Figure 1 for Specific versus General Principles for Constitutional AI
Figure 2 for Specific versus General Principles for Constitutional AI
Figure 3 for Specific versus General Principles for Constitutional AI
Figure 4 for Specific versus General Principles for Constitutional AI
Viaarxiv icon

The Capacity for Moral Self-Correction in Large Language Models

Add code
Feb 18, 2023
Figure 1 for The Capacity for Moral Self-Correction in Large Language Models
Figure 2 for The Capacity for Moral Self-Correction in Large Language Models
Figure 3 for The Capacity for Moral Self-Correction in Large Language Models
Figure 4 for The Capacity for Moral Self-Correction in Large Language Models
Viaarxiv icon

Discovering Language Model Behaviors with Model-Written Evaluations

Add code
Dec 19, 2022
Viaarxiv icon

Constitutional AI: Harmlessness from AI Feedback

Add code
Dec 15, 2022
Figure 1 for Constitutional AI: Harmlessness from AI Feedback
Figure 2 for Constitutional AI: Harmlessness from AI Feedback
Figure 3 for Constitutional AI: Harmlessness from AI Feedback
Figure 4 for Constitutional AI: Harmlessness from AI Feedback
Viaarxiv icon

Measuring Progress on Scalable Oversight for Large Language Models

Add code
Nov 11, 2022
Figure 1 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 2 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 3 for Measuring Progress on Scalable Oversight for Large Language Models
Viaarxiv icon

In-context Learning and Induction Heads

Add code
Sep 24, 2022
Viaarxiv icon

Toy Models of Superposition

Add code
Sep 21, 2022
Viaarxiv icon

Language Models (Mostly) Know What They Know

Add code
Jul 16, 2022
Figure 1 for Language Models (Mostly) Know What They Know
Figure 2 for Language Models (Mostly) Know What They Know
Figure 3 for Language Models (Mostly) Know What They Know
Figure 4 for Language Models (Mostly) Know What They Know
Viaarxiv icon

Scaling Laws and Interpretability of Learning from Repeated Data

Add code
May 21, 2022
Figure 1 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 2 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 3 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 4 for Scaling Laws and Interpretability of Learning from Repeated Data
Viaarxiv icon

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Add code
Apr 12, 2022
Figure 1 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 2 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 3 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 4 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Viaarxiv icon