Picture for Joar Skalse

Joar Skalse

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Add code
Jun 22, 2024
Viaarxiv icon

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Add code
May 10, 2024
Viaarxiv icon

Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

Add code
Mar 11, 2024
Viaarxiv icon

On the Limitations of Markovian Rewards to Express Multi-Objective, Risk-Sensitive, and Modal Tasks

Add code
Jan 26, 2024
Viaarxiv icon

On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning

Add code
Oct 18, 2023
Viaarxiv icon

Goodhart's Law in Reinforcement Learning

Add code
Oct 13, 2023
Viaarxiv icon

STARC: A General Framework For Quantifying Differences Between Reward Functions

Add code
Sep 26, 2023
Viaarxiv icon

Lexicographic Multi-Objective Reinforcement Learning

Add code
Dec 28, 2022
Viaarxiv icon

Misspecification in Inverse Reinforcement Learning

Add code
Dec 06, 2022
Viaarxiv icon

Defining and Characterizing Reward Hacking

Add code
Sep 27, 2022
Figure 1 for Defining and Characterizing Reward Hacking
Figure 2 for Defining and Characterizing Reward Hacking
Figure 3 for Defining and Characterizing Reward Hacking
Figure 4 for Defining and Characterizing Reward Hacking
Viaarxiv icon