Picture for Ollie Jaffe

Ollie Jaffe

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Add code
Jun 12, 2024
Viaarxiv icon