Picture for Charlie Griffin

Charlie Griffin

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Add code
Dec 15, 2025
Viaarxiv icon

Practical challenges of control monitoring in frontier AI deployments

Add code
Dec 15, 2025
Viaarxiv icon

Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols

Add code
Dec 17, 2024
Figure 1 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 2 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 3 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 4 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Viaarxiv icon

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Add code
Sep 12, 2024
Viaarxiv icon

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features

Add code
Nov 07, 2023
Figure 1 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 2 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 3 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 4 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Viaarxiv icon

On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning

Add code
Oct 18, 2023
Figure 1 for On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
Figure 2 for On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
Figure 3 for On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
Figure 4 for On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
Viaarxiv icon

Goodhart's Law in Reinforcement Learning

Add code
Oct 13, 2023
Viaarxiv icon

Lexicographic Multi-Objective Reinforcement Learning

Add code
Dec 28, 2022
Viaarxiv icon

Privacy-preserving Object Detection

Add code
Mar 11, 2021
Figure 1 for Privacy-preserving Object Detection
Figure 2 for Privacy-preserving Object Detection
Figure 3 for Privacy-preserving Object Detection
Figure 4 for Privacy-preserving Object Detection
Viaarxiv icon