Picture for Nicholas Goldowsky-Dill

Nicholas Goldowsky-Dill

Stress Testing Deliberative Alignment for Anti-Scheming Training

Add code
Sep 19, 2025
Figure 1 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Figure 2 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Figure 3 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Figure 4 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Viaarxiv icon

Detecting Strategic Deception Using Linear Probes

Add code
Feb 05, 2025
Figure 1 for Detecting Strategic Deception Using Linear Probes
Figure 2 for Detecting Strategic Deception Using Linear Probes
Figure 3 for Detecting Strategic Deception Using Linear Probes
Figure 4 for Detecting Strategic Deception Using Linear Probes
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Add code
May 17, 2024
Figure 1 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 2 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 3 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Figure 4 for Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Viaarxiv icon

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Add code
May 17, 2024
Figure 1 for Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Viaarxiv icon

Localizing Model Behavior with Path Patching

Add code
Apr 12, 2023
Figure 1 for Localizing Model Behavior with Path Patching
Figure 2 for Localizing Model Behavior with Path Patching
Figure 3 for Localizing Model Behavior with Path Patching
Figure 4 for Localizing Model Behavior with Path Patching
Viaarxiv icon