Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

Oct 23, 2023

Matthieu Meeus, Shubham Jain, Marek Rei, Yves-Alexandre de Montjoye

Figure 1 for Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

Figure 2 for Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

Figure 3 for Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

Figure 4 for Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

Share this with someone who'll enjoy it:

Abstract:With large language models (LLMs) poised to become embedded in our daily lives, questions are starting to be raised about the dataset(s) they learned from. These questions range from potential bias or misinformation LLMs could retain from their training data to questions of copyright and fair use of human-generated text. However, while these questions emerge, developers of the recent state-of-the-art LLMs become increasingly reluctant to disclose details on their training corpus. We here introduce the task of document-level membership inference for real-world LLMs, i.e. inferring whether the LLM has seen a given document during training or not. First, we propose a procedure for the development and evaluation of document-level membership inference for LLMs by leveraging commonly used data sources for training and the model release date. We then propose a practical, black-box method to predict document-level membership and instantiate it on OpenLLaMA-7B with both books and academic papers. We show our methodology to perform very well, reaching an impressive AUC of 0.856 for books and 0.678 for papers. We then show our approach to outperform the sentence-level membership inference attacks used in the privacy literature for the document-level membership task. We finally evaluate whether smaller models might be less sensitive to document-level inference and show OpenLLaMA-3B to be approximately as sensitive as OpenLLaMA-7B to our approach. Taken together, our results show that accurate document-level membership can be inferred for LLMs, increasing the transparency of technology poised to change our lives.

View paper on

Share this with someone who'll enjoy it:

Title:Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

Paper and Code