Picture for Jacob Andreas

Jacob Andreas

Pitfalls in Evaluating Interpretability Agents

Add code
Mar 20, 2026
Viaarxiv icon

Do LLMs Benefit From Their Own Words?

Add code
Feb 27, 2026
Viaarxiv icon

CONCUR: A Framework for Continual Constrained and Unconstrained Routing

Add code
Dec 10, 2025
Viaarxiv icon

ARC Is a Vision Problem!

Add code
Nov 18, 2025
Viaarxiv icon

Training Language Models to Explain Their Own Computations

Add code
Nov 11, 2025
Figure 1 for Training Language Models to Explain Their Own Computations
Figure 2 for Training Language Models to Explain Their Own Computations
Figure 3 for Training Language Models to Explain Their Own Computations
Figure 4 for Training Language Models to Explain Their Own Computations
Viaarxiv icon

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

Add code
Oct 24, 2025
Viaarxiv icon

Modeling Student Learning with 3.8 Million Program Traces

Add code
Oct 06, 2025
Viaarxiv icon

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Add code
Jul 22, 2025
Viaarxiv icon

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Add code
Jul 03, 2025
Viaarxiv icon

Can Gradient Descent Simulate Prompting?

Add code
Jun 26, 2025
Viaarxiv icon