Picture for Stefan Heimersheim

Stefan Heimersheim

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Add code
Feb 17, 2026
Viaarxiv icon

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Add code
Feb 16, 2026
Viaarxiv icon

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

Add code
Nov 10, 2025
Viaarxiv icon

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

Add code
Jul 03, 2025
Viaarxiv icon

Detecting Strategic Deception Using Linear Probes

Add code
Feb 05, 2025
Figure 1 for Detecting Strategic Deception Using Linear Probes
Figure 2 for Detecting Strategic Deception Using Linear Probes
Figure 3 for Detecting Strategic Deception Using Linear Probes
Figure 4 for Detecting Strategic Deception Using Linear Probes
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

Add code
Jan 24, 2025
Figure 1 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 2 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 3 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 4 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Viaarxiv icon

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Add code
Oct 16, 2024
Figure 1 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Figure 2 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Figure 3 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Figure 4 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Viaarxiv icon

Evolution of SAE Features Across Layers in LLMs

Add code
Oct 11, 2024
Figure 1 for Evolution of SAE Features Across Layers in LLMs
Figure 2 for Evolution of SAE Features Across Layers in LLMs
Figure 3 for Evolution of SAE Features Across Layers in LLMs
Figure 4 for Evolution of SAE Features Across Layers in LLMs
Viaarxiv icon

Characterizing stable regions in the residual stream of LLMs

Add code
Sep 26, 2024
Figure 1 for Characterizing stable regions in the residual stream of LLMs
Figure 2 for Characterizing stable regions in the residual stream of LLMs
Figure 3 for Characterizing stable regions in the residual stream of LLMs
Figure 4 for Characterizing stable regions in the residual stream of LLMs
Viaarxiv icon