Picture for Maluna Menke

Maluna Menke

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Add code
Feb 12, 2025
Viaarxiv icon