Picture for Samuel R. Bowman

Samuel R. Bowman

Shammie

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

Spontaneous Reward Hacking in Iterative Self-Refinement

Add code
Jul 05, 2024
Figure 1 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 2 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 3 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 4 for Spontaneous Reward Hacking in Iterative Self-Refinement
Viaarxiv icon

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Add code
Jun 21, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Add code
Apr 24, 2024
Viaarxiv icon

LLM Evaluators Recognize and Favor Their Own Generations

Add code
Apr 15, 2024
Figure 1 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 2 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 3 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 4 for LLM Evaluators Recognize and Favor Their Own Generations
Viaarxiv icon

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Add code
Mar 08, 2024
Figure 1 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 2 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 3 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 4 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Viaarxiv icon

Debating with More Persuasive LLMs Leads to More Truthful Answers

Add code
Feb 15, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Add code
Nov 20, 2023
Viaarxiv icon