Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Leo Richter

An Auditing Test To Detect Behavioral Shift in Language Models

Oct 25, 2024

Leo Richter, Xuanli He, Pasquale Minervini, Matt J. Kusner

Figure 1 for An Auditing Test To Detect Behavioral Shift in Language Models

Figure 2 for An Auditing Test To Detect Behavioral Shift in Language Models

Figure 3 for An Auditing Test To Detect Behavioral Shift in Language Models

Figure 4 for An Auditing Test To Detect Behavioral Shift in Language Models

Abstract:As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model's behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Building on recent work in hypothesis testing, our auditing test detects behavioral shifts solely through model generations. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.

* 25 pages, 12 figures

Via

Access Paper or Ask Questions