Picture for Francis Rhys Ward

Francis Rhys Ward

The Elicitation Game: Evaluating Capability Elicitation Techniques

Add code
Feb 04, 2025
Viaarxiv icon

Evaluating Language Model Character Traits

Add code
Oct 05, 2024
Figure 1 for Evaluating Language Model Character Traits
Figure 2 for Evaluating Language Model Character Traits
Figure 3 for Evaluating Language Model Character Traits
Figure 4 for Evaluating Language Model Character Traits
Viaarxiv icon

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Add code
Jun 12, 2024
Viaarxiv icon

The Reasons that Agents Act: Intention and Instrumental Goals

Add code
Feb 15, 2024
Figure 1 for The Reasons that Agents Act: Intention and Instrumental Goals
Figure 2 for The Reasons that Agents Act: Intention and Instrumental Goals
Figure 3 for The Reasons that Agents Act: Intention and Instrumental Goals
Figure 4 for The Reasons that Agents Act: Intention and Instrumental Goals
Viaarxiv icon

Honesty Is the Best Policy: Defining and Mitigating AI Deception

Add code
Dec 03, 2023
Figure 1 for Honesty Is the Best Policy: Defining and Mitigating AI Deception
Figure 2 for Honesty Is the Best Policy: Defining and Mitigating AI Deception
Figure 3 for Honesty Is the Best Policy: Defining and Mitigating AI Deception
Figure 4 for Honesty Is the Best Policy: Defining and Mitigating AI Deception
Viaarxiv icon

Experiments with Detecting and Mitigating AI Deception

Add code
Jun 26, 2023
Viaarxiv icon

Argumentative Reward Learning: Reasoning About Human Preferences

Add code
Sep 28, 2022
Figure 1 for Argumentative Reward Learning: Reasoning About Human Preferences
Figure 2 for Argumentative Reward Learning: Reasoning About Human Preferences
Figure 3 for Argumentative Reward Learning: Reasoning About Human Preferences
Figure 4 for Argumentative Reward Learning: Reasoning About Human Preferences
Viaarxiv icon