Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Denis Volk

Demonstrating specification gaming in reasoning models

Feb 18, 2025

Alexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish

Abstract:We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1 preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing.

Via

Access Paper or Ask Questions

Hacking CTFs with Plain Agents

Dec 03, 2024

Rustem Turtayev, Artem Petrov, Dmitrii Volkov, Denis Volk

Figure 1 for Hacking CTFs with Plain Agents

Figure 2 for Hacking CTFs with Plain Agents

Figure 3 for Hacking CTFs with Plain Agents

Figure 4 for Hacking CTFs with Plain Agents

Abstract:We saturate a high-school-level hacking benchmark with plain LLM agent design. Concretely, we obtain 95% performance on InterCode-CTF, a popular offensive security benchmark, using prompting, tool use, and multiple attempts. This beats prior work by Phuong et al. 2024 (29%) and Abramovich et al. 2024 (72%). Our results suggest that current LLMs have surpassed the high school level in offensive cybersecurity. Their hacking capabilities remain underelicited: our ReAct&Plan prompting strategy solves many challenges in 1-2 turns without complex engineering or advanced harnessing.

Via

Access Paper or Ask Questions