Picture for Samuel R. Bowman

Samuel R. Bowman

Shammie

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

Spontaneous Reward Hacking in Iterative Self-Refinement

Add code
Jul 05, 2024
Figure 1 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 2 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 3 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 4 for Spontaneous Reward Hacking in Iterative Self-Refinement
Viaarxiv icon

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Add code
Jun 21, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Add code
Apr 24, 2024
Viaarxiv icon

LLM Evaluators Recognize and Favor Their Own Generations

Add code
Apr 15, 2024
Figure 1 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 2 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 3 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 4 for LLM Evaluators Recognize and Favor Their Own Generations
Viaarxiv icon

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Add code
Mar 08, 2024
Figure 1 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 2 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 3 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 4 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Viaarxiv icon

Debating with More Persuasive LLMs Leads to More Truthful Answers

Add code
Feb 15, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon