Picture for Sören Mindermann

Sören Mindermann

Open Problems in Machine Unlearning for AI Safety

Add code
Jan 09, 2025
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Managing AI Risks in an Era of Rapid Progress

Add code
Oct 26, 2023
Viaarxiv icon

Specific versus General Principles for Constitutional AI

Add code
Oct 20, 2023
Figure 1 for Specific versus General Principles for Constitutional AI
Figure 2 for Specific versus General Principles for Constitutional AI
Figure 3 for Specific versus General Principles for Constitutional AI
Figure 4 for Specific versus General Principles for Constitutional AI
Viaarxiv icon

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Add code
Sep 26, 2023
Figure 1 for How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Figure 2 for How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Figure 3 for How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Figure 4 for How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Viaarxiv icon

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

Add code
Jun 16, 2022
Figure 1 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 2 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 3 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 4 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Viaarxiv icon

Prioritized training on points that are learnable, worth learning, and not yet learned

Add code
Jul 06, 2021
Figure 1 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 2 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 3 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 4 for Prioritized training on points that are learnable, worth learning, and not yet learned
Viaarxiv icon

Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

Add code
Mar 08, 2021
Figure 1 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 2 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 3 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 4 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Viaarxiv icon

On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission

Add code
Jul 27, 2020
Figure 1 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 2 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 3 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 4 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Viaarxiv icon