Picture for Carson Denison

Carson Denison

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Gradient-Based Language Model Red Teaming

Add code
Jan 30, 2024
Figure 1 for Gradient-Based Language Model Red Teaming
Figure 2 for Gradient-Based Language Model Red Teaming
Figure 3 for Gradient-Based Language Model Red Teaming
Figure 4 for Gradient-Based Language Model Red Teaming
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Add code
Jul 25, 2023
Figure 1 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 2 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 3 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 4 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Viaarxiv icon

Measuring Faithfulness in Chain-of-Thought Reasoning

Add code
Jul 17, 2023
Figure 1 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 2 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 3 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 4 for Measuring Faithfulness in Chain-of-Thought Reasoning
Viaarxiv icon

How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy

Add code
Mar 02, 2023
Viaarxiv icon

Private Ad Modeling with DP-SGD

Add code
Nov 21, 2022
Viaarxiv icon