Picture for Lucy Farnik

Lucy Farnik

Inducing Human-like Biases in Moral Reasoning Language Models

Add code
Nov 23, 2024
Figure 1 for Inducing Human-like Biases in Moral Reasoning Language Models
Figure 2 for Inducing Human-like Biases in Moral Reasoning Language Models
Figure 3 for Inducing Human-like Biases in Moral Reasoning Language Models
Figure 4 for Inducing Human-like Biases in Moral Reasoning Language Models
Viaarxiv icon

Residual Stream Analysis with Multi-Layer SAEs

Add code
Sep 06, 2024
Figure 1 for Residual Stream Analysis with Multi-Layer SAEs
Figure 2 for Residual Stream Analysis with Multi-Layer SAEs
Figure 3 for Residual Stream Analysis with Multi-Layer SAEs
Figure 4 for Residual Stream Analysis with Multi-Layer SAEs
Viaarxiv icon

STARC: A General Framework For Quantifying Differences Between Reward Functions

Add code
Sep 26, 2023
Viaarxiv icon