Picture for Samuel Marks

Samuel Marks

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Add code
Nov 28, 2024
Viaarxiv icon

Erasing Conceptual Knowledge from Language Models

Add code
Oct 03, 2024
Viaarxiv icon

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

Add code
Aug 02, 2024
Viaarxiv icon

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Add code
Jul 31, 2024
Figure 1 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 2 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 3 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 4 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Viaarxiv icon

NNsight and NDIF: Democratizing Access to Foundation Model Internals

Add code
Jul 18, 2024
Figure 1 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Figure 2 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Figure 3 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Figure 4 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Viaarxiv icon

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Add code
Jun 20, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Add code
Mar 31, 2024
Figure 1 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Figure 2 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Figure 3 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Figure 4 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Viaarxiv icon

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Add code
Oct 10, 2023
Figure 1 for The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Figure 2 for The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Figure 3 for The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Figure 4 for The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon