Picture for Nina Rimsky

Nina Rimsky

Refusal in Language Models Is Mediated by a Single Direction

Add code
Jun 17, 2024
Viaarxiv icon

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

Add code
Jun 13, 2024
Viaarxiv icon

Investigating Bias Representations in Llama 2 Chat via Activation Steering

Add code
Feb 01, 2024
Viaarxiv icon

Steering Llama 2 via Contrastive Activation Addition

Add code
Dec 09, 2023
Viaarxiv icon