Picture for Severin Field

Severin Field

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Add code
Nov 02, 2024
Viaarxiv icon

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Add code
Oct 03, 2024
Viaarxiv icon

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Add code
May 11, 2024
Viaarxiv icon