Picture for Lawrence Chan

Lawrence Chan

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Add code
Dec 04, 2024
Viaarxiv icon

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Add code
Nov 22, 2024
Viaarxiv icon

Mathematical Models of Computation in Superposition

Add code
Aug 10, 2024
Viaarxiv icon

Compact Proofs of Model Performance via Mechanistic Interpretability

Add code
Jun 24, 2024
Viaarxiv icon

Provable Guarantees for Model Performance via Mechanistic Interpretability

Add code
Jun 18, 2024
Viaarxiv icon

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Add code
Jan 04, 2024
Figure 1 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Figure 2 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Figure 3 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Figure 4 for Evaluating Language-Model Agents on Realistic Autonomous Tasks
Viaarxiv icon

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Add code
Feb 06, 2023
Viaarxiv icon

Progress measures for grokking via mechanistic interpretability

Add code
Jan 13, 2023
Viaarxiv icon

Language models are better than humans at next-token prediction

Add code
Dec 21, 2022
Viaarxiv icon

Adversarial Training for High-Stakes Reliability

Add code
May 04, 2022
Figure 1 for Adversarial Training for High-Stakes Reliability
Figure 2 for Adversarial Training for High-Stakes Reliability
Figure 3 for Adversarial Training for High-Stakes Reliability
Figure 4 for Adversarial Training for High-Stakes Reliability
Viaarxiv icon