Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cody Rushing

Basic Legibility Protocols Improve Trusted Monitoring

Feb 09, 2026

Ashwin Sreevatsa, Sebastian Prasanna, Cody Rushing

Abstract:The AI Control research agenda aims to develop control protocols: safety techniques that prevent untrusted AI systems from taking harmful actions during deployment. Because human oversight is expensive, one approach is trusted monitoring, where weaker, trusted models oversee stronger, untrusted models$\unicode{x2013}$but this often fails when the untrusted model's actions exceed the monitor's comprehension. We introduce legibility protocols, which encourage the untrusted model to take actions that are easier for a monitor to evaluate. We perform control evaluations in the APPS coding setting, where an adversarial agent attempts to write backdoored code without detection. We study legibility protocols that allow the untrusted model to thoroughly document its code with comments$\unicode{x2013}$in contrast to prior work, which removed comments to prevent deceptive ones. We find that: (i) commenting protocols improve safety without sacrificing task performance relative to comment-removal baselines; (ii) commenting disproportionately benefits honest code, which typically has a natural explanation that resolves monitor suspicion, whereas backdoored code frequently lacks an easy justification; (iii) gains from commenting increase with monitor strength, as stronger monitors better distinguish genuine justifications from only superficially plausible ones.

Via

Access Paper or Ask Questions

BashArena: A Control Setting for Highly Privileged AI Agents

Dec 17, 2025

Adam Kaufman, James Lucassen, Tyler Tracy, Cody Rushing, Aryan Bhatt

Figure 1 for BashArena: A Control Setting for Highly Privileged AI Agents

Figure 2 for BashArena: A Control Setting for Highly Privileged AI Agents

Figure 3 for BashArena: A Control Setting for Highly Privileged AI Agents

Figure 4 for BashArena: A Control Setting for Highly Privileged AI Agents

Abstract:Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.

* The task generation pipeline can be found here: https://github.com/redwoodresearch/basharena_public

Via

Access Paper or Ask Questions

Ctrl-Z: Controlling AI Agents via Resampling

Apr 14, 2025

Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris

Figure 1 for Ctrl-Z: Controlling AI Agents via Resampling

Figure 2 for Ctrl-Z: Controlling AI Agents via Resampling

Figure 3 for Ctrl-Z: Controlling AI Agents via Resampling

Figure 4 for Ctrl-Z: Controlling AI Agents via Resampling

Abstract:Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.

* bashcontrol.com

Via

Access Paper or Ask Questions

Explorations of Self-Repair in Language Models

Feb 23, 2024

Cody Rushing, Neel Nanda

Abstract:Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We highlight two different mechanisms that contribute to self-repair, including changes in the final LayerNorm scaling factor (which can repair up to 30% of the direct effect) and sparse sets of neurons implementing Anti-Erasure. We additionally discuss the implications of these results for interpretability practitioners and close with a more speculative discussion on the mystery of why self-repair occurs in these models at all, highlighting evidence for the Iterative Inference hypothesis in language models, a framework that predicts self-repair.

Via

Access Paper or Ask Questions

Copy Suppression: Comprehensively Understanding an Attention Head

Oct 06, 2023

Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

Figure 1 for Copy Suppression: Comprehensively Understanding an Attention Head

Figure 2 for Copy Suppression: Comprehensively Understanding an Attention Head

Figure 3 for Copy Suppression: Comprehensively Understanding an Attention Head

Figure 4 for Copy Suppression: Comprehensively Understanding an Attention Head

Abstract:We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a certain token, and this token appears earlier in the context, the head suppresses it: we call this copy suppression. Attention Head 10.7 (L10H7) suppresses naive copying behavior which improves overall model calibration. This explains why multiple prior works studying certain narrow tasks found negative heads that systematically favored the wrong answer. We uncover the mechanism that the Negative Heads use for copy suppression with weights-based evidence and are able to explain 76.9% of the impact of L10H7 in GPT-2 Small. To the best of our knowledge, this is the most comprehensive description of the complete role of a component in a language model to date. One major effect of copy suppression is its role in self-repair. Self-repair refers to how ablating crucial model components results in downstream neural network parts compensating for this ablation. Copy suppression leads to self-repair: if an initial overconfident copier is ablated, then there is nothing to suppress. We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task. Interactive visualisations of the copy suppression phenomena may be seen at our web app https://copy-suppression.streamlit.app/

Via

Access Paper or Ask Questions