Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chhavi Yadav

Can We Infer Confidential Properties of Training Data from LLMs?

Jun 12, 2025

Penguin Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri

Abstract:Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

Via

Access Paper or Ask Questions

ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Feb 06, 2025

Chhavi Yadav, Evan Monroe Laufer, Dan Boneh, Kamalika Chaudhuri

Figure 1 for ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Figure 2 for ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Figure 3 for ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Figure 4 for ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Abstract:In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand \cite{bordt2022post}. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with Zero-Knowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests.

Via

Access Paper or Ask Questions

Evaluating Deep Unlearning in Large Language Models

Oct 19, 2024

Ruihan Wu, Chhavi Yadav, Russ Salakhutdinov, Kamalika Chaudhuri

Figure 1 for Evaluating Deep Unlearning in Large Language Models

Figure 2 for Evaluating Deep Unlearning in Large Language Models

Figure 3 for Evaluating Deep Unlearning in Large Language Models

Figure 4 for Evaluating Deep Unlearning in Large Language Models

Abstract:Machine unlearning is a key requirement of many data protection regulations such as GDPR. Prior work on unlearning has mostly considered superficial unlearning tasks where a single or a few related pieces of information are required to be removed. However, the task of unlearning a fact is much more challenging in recent large language models (LLMs), because the facts in LLMs can be deduced from each other. In this work, we investigate whether current unlearning methods for LLMs succeed beyond superficial unlearning of facts. Specifically, we formally propose a framework and a definition for deep unlearning facts that are interrelated. We design the metric, recall, to quantify the extent of deep unlearning. To systematically evaluate deep unlearning, we construct a synthetic dataset EDU-RELAT, which consists of a synthetic knowledge base of family relationships and biographies, together with a realistic logical rule set that connects them. We use this dataset to test four unlearning methods in four LLMs at different sizes. Our findings reveal that in the task of deep unlearning only a single fact, they either fail to properly unlearn with high recall, or end up unlearning many other irrelevant facts. Our dataset and code are publicly available at: https://github.com/wrh14/deep_unlearning.

Via

Access Paper or Ask Questions

Influence-based Attributions can be Manipulated

Sep 10, 2024

Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri

Abstract:Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influencebased attributions and investigate whether these attributions can be systematically tampered by an adversary. We show that this is indeed possible and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions under adversarial circumstances.

Via

Access Paper or Ask Questions

FairProof : Confidential and Certifiable Fairness for Neural Networks

Feb 19, 2024

Chhavi Yadav, Amrita Roy Chowdhury, Dan Boneh, Kamalika Chaudhuri

Figure 1 for FairProof : Confidential and Certifiable Fairness for Neural Networks

Figure 2 for FairProof : Confidential and Certifiable Fairness for Neural Networks

Figure 3 for FairProof : Confidential and Certifiable Fairness for Neural Networks

Figure 4 for FairProof : Confidential and Certifiable Fairness for Neural Networks

Abstract:Machine learning models are increasingly used in societal applications, yet legal and privacy concerns demand that they very often be kept confidential. Consequently, there is a growing distrust about the fairness properties of these models in the minds of consumers, who are often at the receiving end of model predictions. To this end, we propose FairProof - a system that uses Zero-Knowledge Proofs (a cryptographic primitive) to publicly verify the fairness of a model, while maintaining confidentiality. We also propose a fairness certification algorithm for fully-connected neural networks which is befitting to ZKPs and is used in this system. We implement FairProof in Gnark and demonstrate empirically that our system is practically feasible.

Via

Access Paper or Ask Questions

Keeping Up with the Language Models: Robustness-Bias Interplay in NLI Data and Models

May 22, 2023

Ioana Baldini, Chhavi Yadav, Payel Das, Kush R. Varshney

Figure 1 for Keeping Up with the Language Models: Robustness-Bias Interplay in NLI Data and Models

Figure 2 for Keeping Up with the Language Models: Robustness-Bias Interplay in NLI Data and Models

Figure 3 for Keeping Up with the Language Models: Robustness-Bias Interplay in NLI Data and Models

Figure 4 for Keeping Up with the Language Models: Robustness-Bias Interplay in NLI Data and Models

Abstract:Auditing unwanted social bias in language models (LMs) is inherently hard due to the multidisciplinary nature of the work. In addition, the rapid evolution of LMs can make benchmarks irrelevant in no time. Bias auditing is further complicated by LM brittleness: when a presumably biased outcome is observed, is it due to model bias or model brittleness? We propose enlisting the models themselves to help construct bias auditing datasets that remain challenging, and introduce bias measures that distinguish between types of model errors. First, we extend an existing bias benchmark for NLI (BBNLI) using a combination of LM-generated lexical variations, adversarial filtering, and human validation. We demonstrate that the newly created dataset (BBNLInext) is more challenging than BBNLI: on average, BBNLI-next reduces the accuracy of state-of-the-art NLI models from 95.3%, as observed by BBNLI, to 58.6%. Second, we employ BBNLI-next to showcase the interplay between robustness and bias, and the subtlety in differentiating between the two. Third, we point out shortcomings in current bias scores used in the literature and propose bias measures that take into account pro-/anti-stereotype bias and model brittleness. We will publicly release the BBNLI-next dataset to inspire research on rapidly expanding benchmarks to keep up with model evolution, along with research on the robustness-bias interplay in bias auditing. Note: This paper contains offensive text examples.

Via

Access Paper or Ask Questions

A Learning-Theoretic Framework for Certified Auditing of Machine Learning Models

Jun 09, 2022

Chhavi Yadav, Michal Moshkovitz, Kamalika Chaudhuri

Figure 1 for A Learning-Theoretic Framework for Certified Auditing of Machine Learning Models

Figure 2 for A Learning-Theoretic Framework for Certified Auditing of Machine Learning Models

Figure 3 for A Learning-Theoretic Framework for Certified Auditing of Machine Learning Models

Figure 4 for A Learning-Theoretic Framework for Certified Auditing of Machine Learning Models

Abstract:Responsible use of machine learning requires that models be audited for undesirable properties. However, how to do principled auditing in a general setting has remained ill-understood. In this paper, we propose a formal learning-theoretic framework for auditing. We propose algorithms for auditing linear classifiers for feature sensitivity using label queries as well as different kinds of explanations, and provide performance guarantees. Our results illustrate that while counterfactual explanations can be extremely helpful for auditing, anchor explanations may not be as beneficial in the worst case.

Via

Access Paper or Ask Questions

Behavior of k-NN as an Instance-Based Explanation Method

Sep 14, 2021

Chhavi Yadav, Kamalika Chaudhuri

Figure 1 for Behavior of k-NN as an Instance-Based Explanation Method

Figure 2 for Behavior of k-NN as an Instance-Based Explanation Method

Figure 3 for Behavior of k-NN as an Instance-Based Explanation Method

Figure 4 for Behavior of k-NN as an Instance-Based Explanation Method

Abstract:Adoption of DL models in critical areas has led to an escalating demand for sound explanation methods. Instance-based explanation methods are a popular type that return selective instances from the training set to explain the predictions for a test sample. One way to connect these explanations with prediction is to ask the following counterfactual question - how does the loss and prediction for a test sample change when explanations are removed from the training set? Our paper answers this question for k-NNs which are natural contenders for an instance-based explanation method. We first demonstrate empirically that the representation space induced by last layer of a neural network is the best to perform k-NN in. Using this layer, we conduct our experiments and compare them to influence functions (IFs) ~\cite{koh2017understanding} which try to answer a similar question. Our evaluations do indicate change in loss and predictions when explanations are removed but we do not find a trend between $k$ and loss or prediction change. We find significant stability in the predictions and loss of MNIST vs. CIFAR-10. Surprisingly, we do not observe much difference in the behavior of k-NNs vs. IFs on this question. We attribute this to training set subsampling for IFs.

Via

Access Paper or Ask Questions

On the design of convolutional neural networks for automatic detection of Alzheimer's disease

Nov 12, 2019

Sheng Liu, Chhavi Yadav, Carlos Fernandez-Granda, Narges Razavian

Figure 1 for On the design of convolutional neural networks for automatic detection of Alzheimer's disease

Figure 2 for On the design of convolutional neural networks for automatic detection of Alzheimer's disease

Figure 3 for On the design of convolutional neural networks for automatic detection of Alzheimer's disease

Figure 4 for On the design of convolutional neural networks for automatic detection of Alzheimer's disease

Abstract:Early detection is a crucial goal in the study of Alzheimer's Disease (AD). In this work, we describe several techniques to boost the performance of 3D convolutional neural networks trained to detect AD using structural brain MRI scans. Specifically, we provide evidence that (1) instance normalization outperforms batch normalization, (2) early spatial downsampling negatively affects performance, (3) widening the model brings consistent gains while increasing the depth does not, and (4) incorporating age information yields moderate improvement. Together, these insights yield an increment of approximately 14% in test accuracy over existing models when distinguishing between patients with AD, mild cognitive impairment, and controls in the ADNI dataset. Similar performance is achieved on an independent dataset.

* Proceedings of Machine Learning Research, 2019
* Machine Learning for Health Workshop, NeurIPS2019. Authors Fernandez-Granda and Razavian are joint last authors

Via

Access Paper or Ask Questions

Cold Case: The Lost MNIST Digits

May 25, 2019

Chhavi Yadav, Léon Bottou

Figure 1 for Cold Case: The Lost MNIST Digits

Figure 2 for Cold Case: The Lost MNIST Digits

Figure 3 for Cold Case: The Lost MNIST Digits

Figure 4 for Cold Case: The Lost MNIST Digits

Abstract:Although the popular MNIST dataset [LeCun et al., 1994] is derived from the NIST database [Grother and Hanaoka, 1995], the precise processing steps for this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they enable us to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our results unambiguously confirm the trends observed by Recht et al. [2018, 2019]: although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.

Via

Access Paper or Ask Questions