Picture for Bilal Chughtai

Bilal Chughtai

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Transformer Circuit Faithfulness Metrics are not Robust

Add code
Jul 11, 2024
Viaarxiv icon

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Add code
Jul 05, 2024
Viaarxiv icon

Can Language Models Explain Their Own Classification Behavior?

Add code
May 13, 2024
Viaarxiv icon

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Add code
Feb 11, 2024
Viaarxiv icon

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Add code
Feb 06, 2023
Viaarxiv icon