Picture for Dylan Hadfield-Menell

Dylan Hadfield-Menell

Surgical Activation Steering via Generative Causal Mediation

Add code
Feb 17, 2026
Viaarxiv icon

Open-Universe Assistance Games

Add code
Aug 20, 2025
Figure 1 for Open-Universe Assistance Games
Figure 2 for Open-Universe Assistance Games
Figure 3 for Open-Universe Assistance Games
Figure 4 for Open-Universe Assistance Games
Viaarxiv icon

Layered Unlearning for Adversarial Relearning

Add code
May 14, 2025
Viaarxiv icon

The AI Agent Index

Add code
Feb 03, 2025
Figure 1 for The AI Agent Index
Figure 2 for The AI Agent Index
Figure 3 for The AI Agent Index
Figure 4 for The AI Agent Index
Viaarxiv icon

Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models

Add code
Jan 15, 2025
Figure 1 for Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models
Figure 2 for Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models
Figure 3 for Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models
Figure 4 for Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models
Viaarxiv icon

Goal Inference from Open-Ended Dialog

Add code
Oct 17, 2024
Figure 1 for Goal Inference from Open-Ended Dialog
Figure 2 for Goal Inference from Open-Ended Dialog
Figure 3 for Goal Inference from Open-Ended Dialog
Figure 4 for Goal Inference from Open-Ended Dialog
Viaarxiv icon

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Add code
Jul 22, 2024
Figure 1 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 2 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 3 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 4 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Viaarxiv icon

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Add code
Apr 03, 2024
Figure 1 for The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Figure 2 for The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Figure 3 for The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Figure 4 for The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Viaarxiv icon

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Add code
Mar 08, 2024
Figure 1 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 2 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 3 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 4 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Viaarxiv icon

Eight Methods to Evaluate Robust Unlearning in LLMs

Add code
Feb 26, 2024
Figure 1 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 2 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 3 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 4 for Eight Methods to Evaluate Robust Unlearning in LLMs
Viaarxiv icon