Picture for Martin Wattenberg

Martin Wattenberg

Shared Global and Local Geometry of Language Model Embeddings

Add code
Mar 27, 2025
Viaarxiv icon

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Add code
Feb 18, 2025
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

ICLR: In-Context Learning of Representations

Add code
Dec 29, 2024
Figure 1 for ICLR: In-Context Learning of Representations
Figure 2 for ICLR: In-Context Learning of Representations
Figure 3 for ICLR: In-Context Learning of Representations
Figure 4 for ICLR: In-Context Learning of Representations
Viaarxiv icon

Relational Composition in Neural Networks: A Survey and Call to Action

Add code
Jul 19, 2024
Viaarxiv icon

Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Add code
Jun 17, 2024
Figure 1 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Figure 2 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Figure 3 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Figure 4 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Viaarxiv icon

Designing a Dashboard for Transparency and Control of Conversational AI

Add code
Jun 12, 2024
Figure 1 for Designing a Dashboard for Transparency and Control of Conversational AI
Figure 2 for Designing a Dashboard for Transparency and Control of Conversational AI
Figure 3 for Designing a Dashboard for Transparency and Control of Conversational AI
Figure 4 for Designing a Dashboard for Transparency and Control of Conversational AI
Viaarxiv icon

Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Add code
Feb 22, 2024
Figure 1 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
Figure 2 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
Figure 3 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
Figure 4 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
Viaarxiv icon

Measuring and Controlling Persona Drift in Language Model Dialogs

Add code
Feb 13, 2024
Figure 1 for Measuring and Controlling Persona Drift in Language Model Dialogs
Figure 2 for Measuring and Controlling Persona Drift in Language Model Dialogs
Figure 3 for Measuring and Controlling Persona Drift in Language Model Dialogs
Figure 4 for Measuring and Controlling Persona Drift in Language Model Dialogs
Viaarxiv icon

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Add code
Jan 03, 2024
Figure 1 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Figure 2 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Figure 3 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Figure 4 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Viaarxiv icon