Picture for Paul Christiano

Paul Christiano

Towards a Law of Iterated Expectations for Heuristic Estimators

Add code
Oct 02, 2024
Figure 1 for Towards a Law of Iterated Expectations for Heuristic Estimators
Figure 2 for Towards a Law of Iterated Expectations for Heuristic Estimators
Viaarxiv icon

Backdoor defense, learnability and obfuscation

Add code
Sep 04, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Add code
Jan 04, 2024
Figure 1 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Figure 2 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Figure 3 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Figure 4 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Viaarxiv icon

Model evaluation for extreme risks

Add code
May 24, 2023
Figure 1 for Model evaluation for extreme risks
Figure 2 for Model evaluation for extreme risks
Figure 3 for Model evaluation for extreme risks
Figure 4 for Model evaluation for extreme risks
Viaarxiv icon

Formalizing the presumption of independence

Add code
Nov 12, 2022
Figure 1 for Formalizing the presumption of independence
Figure 2 for Formalizing the presumption of independence
Figure 3 for Formalizing the presumption of independence
Figure 4 for Formalizing the presumption of independence
Viaarxiv icon

Training language models to follow instructions with human feedback

Add code
Mar 04, 2022
Figure 1 for Training language models to follow instructions with human feedback
Figure 2 for Training language models to follow instructions with human feedback
Figure 3 for Training language models to follow instructions with human feedback
Figure 4 for Training language models to follow instructions with human feedback
Viaarxiv icon

Recursively Summarizing Books with Human Feedback

Add code
Sep 27, 2021
Figure 1 for Recursively Summarizing Books with Human Feedback
Figure 2 for Recursively Summarizing Books with Human Feedback
Figure 3 for Recursively Summarizing Books with Human Feedback
Figure 4 for Recursively Summarizing Books with Human Feedback
Viaarxiv icon

Learning to summarize from human feedback

Add code
Sep 02, 2020
Figure 1 for Learning to summarize from human feedback
Figure 2 for Learning to summarize from human feedback
Figure 3 for Learning to summarize from human feedback
Figure 4 for Learning to summarize from human feedback
Viaarxiv icon

Fine-Tuning Language Models from Human Preferences

Add code
Sep 18, 2019
Figure 1 for Fine-Tuning Language Models from Human Preferences
Figure 2 for Fine-Tuning Language Models from Human Preferences
Figure 3 for Fine-Tuning Language Models from Human Preferences
Figure 4 for Fine-Tuning Language Models from Human Preferences
Viaarxiv icon