Picture for Jérémy Scheurer

Jérémy Scheurer

Frontier Models are Capable of In-context Scheming

Add code
Dec 06, 2024
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Analyzing Probabilistic Methods for Evaluating Agent Capabilities

Add code
Sep 24, 2024
Figure 1 for Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Figure 2 for Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure

Add code
Nov 27, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Training Language Models with Language Feedback at Scale

Add code
Apr 09, 2023
Viaarxiv icon

Improving Code Generation by Training with Natural Language Feedback

Add code
Mar 28, 2023
Viaarxiv icon

Few-shot Adaptation Works with UnpredicTable Data

Add code
Aug 08, 2022
Figure 1 for Few-shot Adaptation Works with UnpredicTable Data
Figure 2 for Few-shot Adaptation Works with UnpredicTable Data
Figure 3 for Few-shot Adaptation Works with UnpredicTable Data
Figure 4 for Few-shot Adaptation Works with UnpredicTable Data
Viaarxiv icon

Training Language Models with Natural Language Feedback

Add code
May 02, 2022
Figure 1 for Training Language Models with Natural Language Feedback
Figure 2 for Training Language Models with Natural Language Feedback
Figure 3 for Training Language Models with Natural Language Feedback
Figure 4 for Training Language Models with Natural Language Feedback
Viaarxiv icon