Picture for Francesca Carlon

Francesca Carlon

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Add code
Feb 12, 2025
Viaarxiv icon