Picture for David Lindner

David Lindner

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Add code
Dec 05, 2024
Viaarxiv icon

ViSTa Dataset: Do vision-language models understand sequential tasks?

Add code
Nov 21, 2024
Viaarxiv icon

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Add code
Nov 18, 2024
Figure 1 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework
Figure 2 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework
Figure 3 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework
Figure 4 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Mar 20, 2024
Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Add code
Oct 19, 2023
Viaarxiv icon

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Add code
Aug 08, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Learning Safety Constraints from Demonstrations with Unknown Rewards

Add code
May 25, 2023
Viaarxiv icon