Picture for Francis Rhys Ward

Francis Rhys Ward

Evaluating Language Model Character Traits

Add code
Oct 05, 2024
Figure 1 for Evaluating Language Model Character Traits
Figure 2 for Evaluating Language Model Character Traits
Figure 3 for Evaluating Language Model Character Traits
Figure 4 for Evaluating Language Model Character Traits
Viaarxiv icon

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Add code
Jun 12, 2024
Viaarxiv icon

The Reasons that Agents Act: Intention and Instrumental Goals

Add code
Feb 15, 2024
Viaarxiv icon

Honesty Is the Best Policy: Defining and Mitigating AI Deception

Add code
Dec 03, 2023
Viaarxiv icon

Experiments with Detecting and Mitigating AI Deception

Add code
Jun 26, 2023
Viaarxiv icon

Argumentative Reward Learning: Reasoning About Human Preferences

Add code
Sep 28, 2022
Figure 1 for Argumentative Reward Learning: Reasoning About Human Preferences
Figure 2 for Argumentative Reward Learning: Reasoning About Human Preferences
Figure 3 for Argumentative Reward Learning: Reasoning About Human Preferences
Figure 4 for Argumentative Reward Learning: Reasoning About Human Preferences
Viaarxiv icon