Picture for David Lindner

David Lindner

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Mar 20, 2024
Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Add code
Oct 19, 2023
Viaarxiv icon

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Add code
Aug 08, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Learning Safety Constraints from Demonstrations with Unknown Rewards

Add code
May 25, 2023
Viaarxiv icon

Tracr: Compiled Transformers as a Laboratory for Interpretability

Add code
Jan 12, 2023
Viaarxiv icon

Red-Teaming the Stable Diffusion Safety Filter

Add code
Oct 11, 2022
Figure 1 for Red-Teaming the Stable Diffusion Safety Filter
Figure 2 for Red-Teaming the Stable Diffusion Safety Filter
Figure 3 for Red-Teaming the Stable Diffusion Safety Filter
Figure 4 for Red-Teaming the Stable Diffusion Safety Filter
Viaarxiv icon

Active Exploration for Inverse Reinforcement Learning

Add code
Jul 18, 2022
Figure 1 for Active Exploration for Inverse Reinforcement Learning
Figure 2 for Active Exploration for Inverse Reinforcement Learning
Figure 3 for Active Exploration for Inverse Reinforcement Learning
Figure 4 for Active Exploration for Inverse Reinforcement Learning
Viaarxiv icon