Picture for Carson Denison

Carson Denison

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Figure 1 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 2 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 3 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 4 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Viaarxiv icon

Gradient-Based Language Model Red Teaming

Add code
Jan 30, 2024
Figure 1 for Gradient-Based Language Model Red Teaming
Figure 2 for Gradient-Based Language Model Red Teaming
Figure 3 for Gradient-Based Language Model Red Teaming
Figure 4 for Gradient-Based Language Model Red Teaming
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Add code
Jul 25, 2023
Figure 1 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 2 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 3 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 4 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Viaarxiv icon

Measuring Faithfulness in Chain-of-Thought Reasoning

Add code
Jul 17, 2023
Figure 1 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 2 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 3 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 4 for Measuring Faithfulness in Chain-of-Thought Reasoning
Viaarxiv icon

How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy

Add code
Mar 02, 2023
Figure 1 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Figure 2 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Figure 3 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Figure 4 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Viaarxiv icon

Private Ad Modeling with DP-SGD

Add code
Nov 21, 2022
Viaarxiv icon