Picture for Adrià Garriga-Alonso

Adrià Garriga-Alonso

Shammie

Among Us: A Sandbox for Agentic Deception

Add code
Apr 05, 2025
Viaarxiv icon

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Add code
Apr 02, 2025
Viaarxiv icon

Hypothesis Testing the Circuit Hypothesis in LLMs

Add code
Oct 16, 2024
Viaarxiv icon

Planning behavior in a recurrent neural network that plays Sokoban

Add code
Jul 22, 2024
Viaarxiv icon

Adversarial Circuit Evaluation

Add code
Jul 21, 2024
Viaarxiv icon

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Add code
Jul 19, 2024
Viaarxiv icon

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Add code
Jul 19, 2024
Figure 1 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 2 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 3 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 4 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Viaarxiv icon

Investigating the Indirect Object Identification circuit in Mamb

Add code
Jul 19, 2024
Viaarxiv icon

Towards Automated Circuit Discovery for Mechanistic Interpretability

Add code
Apr 28, 2023
Figure 1 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 2 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 3 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 4 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Viaarxiv icon

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Add code
Jun 10, 2022
Viaarxiv icon