Picture for David Duvenaud

David Duvenaud

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language

Add code
May 21, 2024
Viaarxiv icon

Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs

Add code
Feb 13, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Sorting Out Quantum Monte Carlo

Add code
Nov 09, 2023
Viaarxiv icon

Towards Understanding Sycophancy in Language Models

Add code
Oct 27, 2023
Figure 1 for Towards Understanding Sycophancy in Language Models
Figure 2 for Towards Understanding Sycophancy in Language Models
Figure 3 for Towards Understanding Sycophancy in Language Models
Figure 4 for Towards Understanding Sycophancy in Language Models
Viaarxiv icon

Tools for Verifying Neural Models' Training Data

Add code
Jul 02, 2023
Viaarxiv icon

On Implicit Bias in Overparameterized Bilevel Optimization

Add code
Dec 28, 2022
Viaarxiv icon

Meta-Learning to Improve Pre-Training

Add code
Nov 02, 2021
Figure 1 for Meta-Learning to Improve Pre-Training
Figure 2 for Meta-Learning to Improve Pre-Training
Viaarxiv icon