Picture for David Krueger

David Krueger

Distributional Training Data Attribution

Add code
Jun 15, 2025
Viaarxiv icon

Detecting High-Stakes Interactions with Activation Probes

Add code
Jun 12, 2025
Viaarxiv icon

Understanding (Un)Reliability of Steering Vectors in Language Models

Add code
May 28, 2025
Viaarxiv icon

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

Add code
May 28, 2025
Viaarxiv icon

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Add code
Apr 02, 2025
Viaarxiv icon

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Add code
Feb 27, 2025
Viaarxiv icon

Open Problems in Machine Unlearning for AI Safety

Add code
Jan 09, 2025
Viaarxiv icon

Learning to Forget using Hypernetworks

Add code
Dec 01, 2024
Figure 1 for Learning to Forget using Hypernetworks
Figure 2 for Learning to Forget using Hypernetworks
Figure 3 for Learning to Forget using Hypernetworks
Figure 4 for Learning to Forget using Hypernetworks
Viaarxiv icon

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

Add code
Nov 11, 2024
Figure 1 for Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Figure 2 for Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Figure 3 for Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Figure 4 for Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Viaarxiv icon

Adversarial Robustness of In-Context Learning in Transformers for Linear Regression

Add code
Nov 07, 2024
Viaarxiv icon