Picture for Monte MacDiarmid

Monte MacDiarmid

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Figure 1 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 2 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 3 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 4 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Understanding and Controlling a Maze-Solving Policy Network

Add code
Oct 12, 2023
Viaarxiv icon

Activation Addition: Steering Language Models Without Optimization

Add code
Sep 01, 2023
Viaarxiv icon