Picture for Yonatan Belinkov

Yonatan Belinkov

ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Add code
Oct 01, 2025
Viaarxiv icon

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Add code
Jul 09, 2025
Viaarxiv icon

Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Add code
Jun 11, 2025
Viaarxiv icon

SAEs Are Good for Steering -- If You Select the Right Features

Add code
May 26, 2025
Viaarxiv icon

Language Models use Lookbacks to Track Beliefs

Add code
May 20, 2025
Viaarxiv icon

MIB: A Mechanistic Interpretability Benchmark

Add code
Apr 17, 2025
Viaarxiv icon

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Add code
Apr 01, 2025
Viaarxiv icon

How Generative IR Retrieves Documents Mechanistically

Add code
Mar 25, 2025
Viaarxiv icon

Inside-Out: Hidden Factual Knowledge in LLMs

Add code
Mar 19, 2025
Viaarxiv icon

Are formal and functional linguistic mechanisms dissociated?

Add code
Mar 14, 2025
Viaarxiv icon