Picture for Bilal Chughtai

Bilal Chughtai

Detecting Strategic Deception Using Linear Probes

Add code
Feb 05, 2025
Figure 1 for Detecting Strategic Deception Using Linear Probes
Figure 2 for Detecting Strategic Deception Using Linear Probes
Figure 3 for Detecting Strategic Deception Using Linear Probes
Figure 4 for Detecting Strategic Deception Using Linear Probes
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Transformer Circuit Faithfulness Metrics are not Robust

Add code
Jul 11, 2024
Viaarxiv icon

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Add code
Jul 05, 2024
Viaarxiv icon

Can Language Models Explain Their Own Classification Behavior?

Add code
May 13, 2024
Figure 1 for Can Language Models Explain Their Own Classification Behavior?
Figure 2 for Can Language Models Explain Their Own Classification Behavior?
Figure 3 for Can Language Models Explain Their Own Classification Behavior?
Figure 4 for Can Language Models Explain Their Own Classification Behavior?
Viaarxiv icon

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Add code
Feb 11, 2024
Viaarxiv icon

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Add code
Feb 06, 2023
Figure 1 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Figure 2 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Figure 3 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Figure 4 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Viaarxiv icon