Picture for Alex Beutel

Alex Beutel

Tony

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Add code
Mar 11, 2026
Viaarxiv icon

OpenAI GPT-5 System Card

Add code
Dec 19, 2025
Viaarxiv icon

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Add code
May 13, 2025
Figure 1 for HealthBench: Evaluating Large Language Models Towards Improved Human Health
Figure 2 for HealthBench: Evaluating Large Language Models Towards Improved Human Health
Figure 3 for HealthBench: Evaluating Large Language Models Towards Improved Human Health
Figure 4 for HealthBench: Evaluating Large Language Models Towards Improved Human Health
Viaarxiv icon

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Add code
Dec 24, 2024
Figure 1 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Figure 2 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Figure 3 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Figure 4 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Viaarxiv icon

OpenAI o1 System Card

Add code
Dec 21, 2024
Figure 1 for OpenAI o1 System Card
Figure 2 for OpenAI o1 System Card
Figure 3 for OpenAI o1 System Card
Figure 4 for OpenAI o1 System Card
Viaarxiv icon

Deliberative Alignment: Reasoning Enables Safer Language Models

Add code
Dec 20, 2024
Viaarxiv icon

Rule Based Rewards for Language Model Safety

Add code
Nov 02, 2024
Viaarxiv icon

GPT-4o System Card

Add code
Oct 25, 2024
Viaarxiv icon

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Add code
Apr 19, 2024
Figure 1 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Figure 2 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Figure 3 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Figure 4 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Viaarxiv icon

Generalized People Diversity: Learning a Human Perception-Aligned Diversity Representation for People Images

Add code
Jan 25, 2024
Figure 1 for Generalized People Diversity: Learning a Human Perception-Aligned Diversity Representation for People Images
Figure 2 for Generalized People Diversity: Learning a Human Perception-Aligned Diversity Representation for People Images
Figure 3 for Generalized People Diversity: Learning a Human Perception-Aligned Diversity Representation for People Images
Figure 4 for Generalized People Diversity: Learning a Human Perception-Aligned Diversity Representation for People Images
Viaarxiv icon