Picture for Jan Leike

Jan Leike

Tony

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Viaarxiv icon

GPT-4o System Card

Add code
Oct 25, 2024
Viaarxiv icon

Prover-Verifier Games improve legibility of LLM outputs

Add code
Jul 18, 2024
Viaarxiv icon

LLM Critics Help Catch LLM Bugs

Add code
Jun 28, 2024
Viaarxiv icon

Scaling and evaluating sparse autoencoders

Add code
Jun 06, 2024
Figure 1 for Scaling and evaluating sparse autoencoders
Figure 2 for Scaling and evaluating sparse autoencoders
Figure 3 for Scaling and evaluating sparse autoencoders
Figure 4 for Scaling and evaluating sparse autoencoders
Viaarxiv icon

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Add code
Dec 14, 2023
Figure 1 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Figure 2 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Figure 3 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Figure 4 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Viaarxiv icon

Let's Verify Step by Step

Add code
May 31, 2023
Figure 1 for Let's Verify Step by Step
Figure 2 for Let's Verify Step by Step
Figure 3 for Let's Verify Step by Step
Figure 4 for Let's Verify Step by Step
Viaarxiv icon

Self-critiquing models for assisting human evaluators

Add code
Jun 14, 2022
Figure 1 for Self-critiquing models for assisting human evaluators
Figure 2 for Self-critiquing models for assisting human evaluators
Figure 3 for Self-critiquing models for assisting human evaluators
Figure 4 for Self-critiquing models for assisting human evaluators
Viaarxiv icon

Training language models to follow instructions with human feedback

Add code
Mar 04, 2022
Figure 1 for Training language models to follow instructions with human feedback
Figure 2 for Training language models to follow instructions with human feedback
Figure 3 for Training language models to follow instructions with human feedback
Figure 4 for Training language models to follow instructions with human feedback
Viaarxiv icon

Safe Deep RL in 3D Environments using Human Feedback

Add code
Jan 21, 2022
Figure 1 for Safe Deep RL in 3D Environments using Human Feedback
Figure 2 for Safe Deep RL in 3D Environments using Human Feedback
Figure 3 for Safe Deep RL in 3D Environments using Human Feedback
Figure 4 for Safe Deep RL in 3D Environments using Human Feedback
Viaarxiv icon