Picture for Jared Kaplan

Jared Kaplan

Shammie

Forecasting Rare Language Model Behaviors

Add code
Feb 24, 2025
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Clio: Privacy-Preserving Insights into Real-World AI Use

Add code
Dec 18, 2024
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Figure 1 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 2 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 3 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 4 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Evaluating and Mitigating Discrimination in Language Model Decisions

Add code
Dec 06, 2023
Viaarxiv icon

Specific versus General Principles for Constitutional AI

Add code
Oct 20, 2023
Figure 1 for Specific versus General Principles for Constitutional AI
Figure 2 for Specific versus General Principles for Constitutional AI
Figure 3 for Specific versus General Principles for Constitutional AI
Figure 4 for Specific versus General Principles for Constitutional AI
Viaarxiv icon

Studying Large Language Model Generalization with Influence Functions

Add code
Aug 07, 2023
Figure 1 for Studying Large Language Model Generalization with Influence Functions
Figure 2 for Studying Large Language Model Generalization with Influence Functions
Figure 3 for Studying Large Language Model Generalization with Influence Functions
Figure 4 for Studying Large Language Model Generalization with Influence Functions
Viaarxiv icon