Picture for Teun van der Weij

Teun van der Weij

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Add code
Dec 02, 2024
Figure 1 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 2 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 3 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 4 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Viaarxiv icon

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Add code
Jun 12, 2024
Viaarxiv icon

Extending Activation Steering to Broad Skills and Multiple Behaviours

Add code
Mar 09, 2024
Viaarxiv icon

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Add code
Jul 03, 2023
Viaarxiv icon