Picture for Buck Shlegeris

Buck Shlegeris

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols

Add code
Dec 17, 2024
Viaarxiv icon

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Add code
Nov 26, 2024
Figure 1 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 2 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 3 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 4 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Add code
Sep 12, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Viaarxiv icon

Generalized Wick Decompositions

Add code
Oct 10, 2023
Viaarxiv icon