Picture for Mikita Balesni

Mikita Balesni

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Add code
Oct 09, 2024
Viaarxiv icon

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Add code
Jul 05, 2024
Viaarxiv icon

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure

Add code
Nov 27, 2023
Viaarxiv icon

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Add code
Sep 22, 2023
Viaarxiv icon

Taken out of context: On measuring situational awareness in LLMs

Add code
Sep 01, 2023
Viaarxiv icon