Picture for Adrià Garriga-Alonso

Adrià Garriga-Alonso

Shammie

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

Add code
Mar 02, 2026
Viaarxiv icon

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

Add code
Feb 16, 2026
Viaarxiv icon

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Add code
Feb 11, 2026
Viaarxiv icon

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

Add code
Jun 11, 2025
Figure 1 for Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban
Figure 2 for Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban
Figure 3 for Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban
Figure 4 for Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban
Viaarxiv icon

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

Add code
May 16, 2025
Figure 1 for Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
Figure 2 for Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
Figure 3 for Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
Figure 4 for Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
Viaarxiv icon

Among Us: A Sandbox for Agentic Deception

Add code
Apr 05, 2025
Figure 1 for Among Us: A Sandbox for Agentic Deception
Figure 2 for Among Us: A Sandbox for Agentic Deception
Figure 3 for Among Us: A Sandbox for Agentic Deception
Figure 4 for Among Us: A Sandbox for Agentic Deception
Viaarxiv icon

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Add code
Apr 02, 2025
Figure 1 for Interpreting Emergent Planning in Model-Free Reinforcement Learning
Figure 2 for Interpreting Emergent Planning in Model-Free Reinforcement Learning
Figure 3 for Interpreting Emergent Planning in Model-Free Reinforcement Learning
Figure 4 for Interpreting Emergent Planning in Model-Free Reinforcement Learning
Viaarxiv icon

Hypothesis Testing the Circuit Hypothesis in LLMs

Add code
Oct 16, 2024
Viaarxiv icon

Planning behavior in a recurrent neural network that plays Sokoban

Add code
Jul 22, 2024
Viaarxiv icon

Adversarial Circuit Evaluation

Add code
Jul 21, 2024
Viaarxiv icon