Picture for Dmitrii Volkov

Dmitrii Volkov

Demonstrating specification gaming in reasoning models

Add code
Feb 18, 2025
Viaarxiv icon

Resurrecting saturated LLM benchmarks with adversarial encoding

Add code
Feb 10, 2025
Viaarxiv icon

BadGPT-4o: stripping safety finetuning from GPT models

Add code
Dec 06, 2024
Viaarxiv icon

Hacking CTFs with Plain Agents

Add code
Dec 03, 2024
Figure 1 for Hacking CTFs with Plain Agents
Figure 2 for Hacking CTFs with Plain Agents
Figure 3 for Hacking CTFs with Plain Agents
Figure 4 for Hacking CTFs with Plain Agents
Viaarxiv icon

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

Add code
Oct 17, 2024
Viaarxiv icon

Badllama 3: removing safety finetuning from Llama 3 in minutes

Add code
Jul 01, 2024
Viaarxiv icon