Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dmitrii Volkov

Demonstrating specification gaming in reasoning models

Feb 18, 2025

Alexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish

Abstract:We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1 preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing.

Via

Access Paper or Ask Questions

Resurrecting saturated LLM benchmarks with adversarial encoding

Feb 10, 2025

Igor Ivanov, Dmitrii Volkov

Figure 1 for Resurrecting saturated LLM benchmarks with adversarial encoding

Figure 2 for Resurrecting saturated LLM benchmarks with adversarial encoding

Figure 3 for Resurrecting saturated LLM benchmarks with adversarial encoding

Figure 4 for Resurrecting saturated LLM benchmarks with adversarial encoding

Abstract:Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.

Via

Access Paper or Ask Questions

BadGPT-4o: stripping safety finetuning from GPT models

Dec 06, 2024

Ekaterina Krupkina, Dmitrii Volkov

Abstract:We show a version of Qi et al. 2023's simple fine-tuning poisoning technique strips GPT-4o's safety guardrails without degrading the model. The BadGPT attack matches best white-box jailbreaks on HarmBench and StrongREJECT. It suffers no token overhead or performance hits common to jailbreaks, as evaluated on tinyMMLU and open-ended generations. Despite having been known for a year, this attack remains easy to execute.

Via

Access Paper or Ask Questions

Hacking CTFs with Plain Agents

Dec 03, 2024

Rustem Turtayev, Artem Petrov, Dmitrii Volkov, Denis Volk

Figure 1 for Hacking CTFs with Plain Agents

Figure 2 for Hacking CTFs with Plain Agents

Figure 3 for Hacking CTFs with Plain Agents

Figure 4 for Hacking CTFs with Plain Agents

Abstract:We saturate a high-school-level hacking benchmark with plain LLM agent design. Concretely, we obtain 95% performance on InterCode-CTF, a popular offensive security benchmark, using prompting, tool use, and multiple attempts. This beats prior work by Phuong et al. 2024 (29%) and Abramovich et al. 2024 (72%). Our results suggest that current LLMs have surpassed the high school level in offensive cybersecurity. Their hacking capabilities remain underelicited: our ReAct&Plan prompting strategy solves many challenges in 1-2 turns without complex engineering or advanced harnessing.

Via

Access Paper or Ask Questions

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

Oct 17, 2024

Reworr, Dmitrii Volkov

Abstract:We introduce the LLM Honeypot, a system for monitoring autonomous AI hacking agents. We deployed a customized SSH honeypot and applied prompt injections with temporal analysis to identify LLM-based agents among attackers. Over a trial run of a few weeks in a public environment, we collected 800,000 hacking attempts and 6 potential AI agents, which we plan to analyze in depth in future work. Our objectives aim to improve awareness of AI hacking agents and enhance preparedness for their risks.

Via

Access Paper or Ask Questions

Badllama 3: removing safety finetuning from Llama 3 in minutes

Jul 01, 2024

Dmitrii Volkov

Abstract:We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

Via

Access Paper or Ask Questions