Picture for Lucy Farnik

Lucy Farnik

Inducing Human-like Biases in Moral Reasoning Language Models

Add code
Nov 23, 2024
Viaarxiv icon

Residual Stream Analysis with Multi-Layer SAEs

Add code
Sep 06, 2024
Viaarxiv icon

STARC: A General Framework For Quantifying Differences Between Reward Functions

Add code
Sep 26, 2023
Viaarxiv icon