Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Eight Methods to Evaluate Robust Unlearning in LLMs

Feb 26, 2024

Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Figure 1 for Eight Methods to Evaluate Robust Unlearning in LLMs

Figure 2 for Eight Methods to Evaluate Robust Unlearning in LLMs

Figure 3 for Eight Methods to Evaluate Robust Unlearning in LLMs

Figure 4 for Eight Methods to Evaluate Robust Unlearning in LLMs

Share this with someone who'll enjoy it:

Abstract:Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). While WHP's unlearning generalizes well when evaluated with the "Familiarity" metric from Eldan and Russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP performs on par with the original model on Harry Potter Q&A tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. Overall, our results highlight the importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.

View paper on

Share this with someone who'll enjoy it:

Title:Eight Methods to Evaluate Robust Unlearning in LLMs

Paper and Code