Picture for Dmitrii Krasheninnikov

Dmitrii Krasheninnikov

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

Add code
Nov 11, 2024
Viaarxiv icon

Stress-Testing Capability Elicitation With Password-Locked Models

Add code
May 29, 2024
Viaarxiv icon

Meta- (out-of-context) learning in neural networks

Add code
Oct 24, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Defining and Characterizing Reward Hacking

Add code
Sep 27, 2022
Figure 1 for Defining and Characterizing Reward Hacking
Figure 2 for Defining and Characterizing Reward Hacking
Figure 3 for Defining and Characterizing Reward Hacking
Figure 4 for Defining and Characterizing Reward Hacking
Viaarxiv icon

Combining Reward Information from Multiple Sources

Add code
Mar 22, 2021
Figure 1 for Combining Reward Information from Multiple Sources
Figure 2 for Combining Reward Information from Multiple Sources
Viaarxiv icon

Preferences Implicit in the State of the World

Add code
Feb 12, 2019
Figure 1 for Preferences Implicit in the State of the World
Figure 2 for Preferences Implicit in the State of the World
Figure 3 for Preferences Implicit in the State of the World
Figure 4 for Preferences Implicit in the State of the World
Viaarxiv icon