Picture for Buck Shlegeris

Buck Shlegeris

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Add code
Sep 12, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Viaarxiv icon

Generalized Wick Decompositions

Add code
Oct 10, 2023
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Viaarxiv icon

Language models are better than humans at next-token prediction

Add code
Dec 21, 2022
Viaarxiv icon

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Add code
Nov 01, 2022
Viaarxiv icon