Picture for Samuel Marks

Samuel Marks

Erasing Conceptual Knowledge from Language Models

Add code
Oct 03, 2024
Viaarxiv icon

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

Add code
Aug 02, 2024
Viaarxiv icon

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Add code
Jul 31, 2024
Viaarxiv icon

NNsight and NDIF: Democratizing Access to Foundation Model Internals

Add code
Jul 18, 2024
Viaarxiv icon

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Add code
Jun 20, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Add code
Mar 31, 2024
Viaarxiv icon

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Add code
Oct 10, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Viaarxiv icon