Picture for Diogo Schwerz de Lucena

Diogo Schwerz de Lucena

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Add code
Dec 20, 2024
Viaarxiv icon

Unexpected Benefits of Self-Modeling in Neural Systems

Add code
Jul 14, 2024
Viaarxiv icon

Rethinking harmless refusals when fine-tuning foundation models

Add code
Jun 27, 2024
Viaarxiv icon