Picture for Michael Sellitto

Michael Sellitto

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

The Capacity for Moral Self-Correction in Large Language Models

Add code
Feb 18, 2023
Figure 1 for The Capacity for Moral Self-Correction in Large Language Models
Figure 2 for The Capacity for Moral Self-Correction in Large Language Models
Figure 3 for The Capacity for Moral Self-Correction in Large Language Models
Figure 4 for The Capacity for Moral Self-Correction in Large Language Models
Viaarxiv icon

Discovering Language Model Behaviors with Model-Written Evaluations

Add code
Dec 19, 2022
Figure 1 for Discovering Language Model Behaviors with Model-Written Evaluations
Figure 2 for Discovering Language Model Behaviors with Model-Written Evaluations
Figure 3 for Discovering Language Model Behaviors with Model-Written Evaluations
Figure 4 for Discovering Language Model Behaviors with Model-Written Evaluations
Viaarxiv icon

Constitutional AI: Harmlessness from AI Feedback

Add code
Dec 15, 2022
Figure 1 for Constitutional AI: Harmlessness from AI Feedback
Figure 2 for Constitutional AI: Harmlessness from AI Feedback
Figure 3 for Constitutional AI: Harmlessness from AI Feedback
Figure 4 for Constitutional AI: Harmlessness from AI Feedback
Viaarxiv icon

The AI Index 2022 Annual Report

Add code
May 02, 2022
Figure 1 for The AI Index 2022 Annual Report
Figure 2 for The AI Index 2022 Annual Report
Figure 3 for The AI Index 2022 Annual Report
Figure 4 for The AI Index 2022 Annual Report
Viaarxiv icon

The AI Index 2021 Annual Report

Add code
Mar 09, 2021
Figure 1 for The AI Index 2021 Annual Report
Figure 2 for The AI Index 2021 Annual Report
Figure 3 for The AI Index 2021 Annual Report
Figure 4 for The AI Index 2021 Annual Report
Viaarxiv icon