Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Benchmarks for Detecting Measurement Tampering

Sep 07, 2023

Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

Figure 1 for Benchmarks for Detecting Measurement Tampering

Figure 2 for Benchmarks for Detecting Measurement Tampering

Figure 3 for Benchmarks for Detecting Measurement Tampering

Figure 4 for Benchmarks for Detecting Measurement Tampering

Share this with someone who'll enjoy it:

Abstract:When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measurement tampering detection techniques on large language models. Concretely, given sets of text inputs and measurements aimed at determining if some outcome occurred, as well as a base model able to accurately predict measurements, the goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering. We demonstrate techniques that outperform simple baselines on most datasets, but don't achieve maximum performance. We believe there is significant room for improvement for both techniques and datasets, and we are excited for future work tackling measurement tampering.

* Edit: extended and improved appendices, fixed references & figures

View paper on

Share this with someone who'll enjoy it:

Title:Benchmarks for Detecting Measurement Tampering

Paper and Code