Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Soumi Das

Revisiting Privacy, Utility, and Efficiency Trade-offs when Fine-Tuning Large Language Models

Feb 18, 2025

Soumi Das, Camila Kolling, Mohammad Aflah Khan, Mahsa Amani, Bishwamittra Ghosh, Qinyuan Wu, Till Speicher, Krishna P. Gummadi

Abstract:We study the inherent trade-offs in minimizing privacy risks and maximizing utility, while maintaining high computational efficiency, when fine-tuning large language models (LLMs). A number of recent works in privacy research have attempted to mitigate privacy risks posed by memorizing fine-tuning data by using differentially private training methods (e.g., DP), albeit at a significantly higher computational cost (inefficiency). In parallel, several works in systems research have focussed on developing (parameter) efficient fine-tuning methods (e.g., LoRA), but few works, if any, investigated whether such efficient methods enhance or diminish privacy risks. In this paper, we investigate this gap and arrive at a surprising conclusion: efficient fine-tuning methods like LoRA mitigate privacy risks similar to private fine-tuning methods like DP. Our empirical finding directly contradicts prevailing wisdom that privacy and efficiency objectives are at odds during fine-tuning. Our finding is established by (a) carefully defining measures of privacy and utility that distinguish between memorizing sensitive and non-sensitive tokens in training and test datasets used in fine-tuning and (b) extensive evaluations using multiple open-source language models from Pythia, Gemma, and Llama families and different domain-specific datasets.

* This is a work in progress. The draft may change in future

Via

Access Paper or Ask Questions

Understanding Memorisation in LLMs: Dynamics, Influencing Factors, and Implications

Jul 27, 2024

Till Speicher, Mohammad Aflah Khan, Qinyuan Wu, Vedant Nanda, Soumi Das, Bishwamittra Ghosh, Krishna P. Gummadi, Evimaria Terzi

Abstract:Understanding whether and to what extent large language models (LLMs) have memorised training data has important implications for the reliability of their output and the privacy of their training data. In order to cleanly measure and disentangle memorisation from other phenomena (e.g. in-context learning), we create an experimental framework that is based on repeatedly exposing LLMs to random strings. Our framework allows us to better understand the dynamics, i.e., the behaviour of the model, when repeatedly exposing it to random strings. Using our framework, we make several striking observations: (a) we find consistent phases of the dynamics across families of models (Pythia, Phi and Llama2), (b) we identify factors that make some strings easier to memorise than others, and (c) we identify the role of local prefixes and global context in memorisation. We also show that sequential exposition to different random strings has a significant effect on memorisation. Our results, often surprising, have significant downstream implications in the study and usage of LLMs.

Via

Access Paper or Ask Questions

Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction

Apr 19, 2024

Qinyuan Wu, Mohammad Aflah Khan, Soumi Das, Vedant Nanda, Bishwamittra Ghosh, Camila Kolling, Till Speicher, Laurent Bindschaedler, Krishna P. Gummadi, Evimaria Terzi

Figure 1 for Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction

Figure 2 for Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction

Figure 3 for Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction

Figure 4 for Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction

Abstract:We propose an approach for estimating the latent knowledge embedded inside large language models (LLMs). We leverage the in-context learning (ICL) abilities of LLMs to estimate the extent to which an LLM knows the facts stored in a knowledge base. Our knowledge estimator avoids reliability concerns with previous prompting-based methods, is both conceptually simpler and easier to apply, and we demonstrate that it can surface more of the latent knowledge embedded in LLMs. We also investigate how different design choices affect the performance of ICL-based knowledge estimation. Using the proposed estimator, we perform a large-scale evaluation of the factual knowledge of a variety of open source LLMs, like OPT, Pythia, Llama(2), Mistral, Gemma, etc. over a large set of relations and facts from the Wikidata knowledge base. We observe differences in the factual knowledge between different model families and models of different sizes, that some relations are consistently better known than others but that models differ in the precise facts they know, and differences in the knowledge of base models and their finetuned counterparts.

Via

Access Paper or Ask Questions

VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI

Mar 08, 2024

Soumi Das, Shubhadip Nag, Shreyyash Sharma, Suparna Bhattacharya, Sourangshu Bhattacharya

Figure 1 for VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI

Figure 2 for VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI

Figure 3 for VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI

Figure 4 for VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI

Abstract:Trustworthy AI is crucial to the widespread adoption of AI in high-stakes applications with fairness, robustness, and accuracy being some of the key trustworthiness metrics. In this work, we propose a controllable framework for data-centric trustworthy AI (DCTAI)- VTruST, that allows users to control the trade-offs between the different trustworthiness metrics of the constructed training datasets. A key challenge in implementing an efficient DCTAI framework is to design an online value-function-based training data subset selection algorithm. We pose the training data valuation and subset selection problem as an online sparse approximation formulation. We propose a novel online version of the Orthogonal Matching Pursuit (OMP) algorithm for solving this problem. Experimental results show that VTruST outperforms the state-of-the-art baselines on social, image, and scientific datasets. We also show that the data values generated by VTruST can provide effective data-centric explanations for different trustworthiness metrics.

* Accepted in ICLR 2024 DMLR workshop

Via

Access Paper or Ask Questions

LearnDefend: Learning to Defend against Targeted Model-Poisoning Attacks on Federated Learning

May 03, 2023

Kiran Purohit, Soumi Das, Sourangshu Bhattacharya, Santu Rana

Figure 1 for LearnDefend: Learning to Defend against Targeted Model-Poisoning Attacks on Federated Learning

Figure 2 for LearnDefend: Learning to Defend against Targeted Model-Poisoning Attacks on Federated Learning

Figure 3 for LearnDefend: Learning to Defend against Targeted Model-Poisoning Attacks on Federated Learning

Figure 4 for LearnDefend: Learning to Defend against Targeted Model-Poisoning Attacks on Federated Learning

Abstract:Targeted model poisoning attacks pose a significant threat to federated learning systems. Recent studies show that edge-case targeted attacks, which target a small fraction of the input space are nearly impossible to counter using existing fixed defense strategies. In this paper, we strive to design a learned-defense strategy against such attacks, using a small defense dataset. The defense dataset can be collected by the central authority of the federated learning task, and should contain a mix of poisoned and clean examples. The proposed framework, LearnDefend, estimates the probability of a client update being malicious. The examples in defense dataset need not be pre-marked as poisoned or clean. We also learn a poisoned data detector model which can be used to mark each example in the defense dataset as clean or poisoned. We estimate the poisoned data detector and the client importance models in a coupled optimization approach. Our experiments demonstrate that LearnDefend is capable of defending against state-of-the-art attacks where existing fixed defense strategies fail. We also show that LearnDefend is robust to size and noise in the marking of clean examples in the defense dataset.

Via

Access Paper or Ask Questions

CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Mar 14, 2022

Soumi Das, Manasvi Sagarkar, Suparna Bhattacharya, Sourangshu Bhattacharya

Figure 1 for CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Figure 2 for CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Figure 3 for CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Figure 4 for CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Abstract:Data valuation and subset selection have emerged as valuable tools for application-specific selection of important training data. However, the efficiency-accuracy tradeoffs of state-of-the-art methods hinder their widespread application to many AI workflows. In this paper, we propose a novel 2-phase solution to this problem. Phase 1 selects representative checkpoints from an SGD-like training algorithm, which are used in phase-2 to estimate the approximate training data values, e.g. decrease in validation loss due to each training point. A key contribution of this paper is CheckSel, an Orthogonal Matching Pursuit-inspired online sparse approximation algorithm for checkpoint selection in the online setting, where the features are revealed one at a time. Another key contribution is the study of data valuation in the domain adaptation setting, where a data value estimator obtained using checkpoints from training trajectory in the source domain training dataset is used for data valuation in a target domain training dataset. Experimental results on benchmark datasets show the proposed algorithm outperforms recent baseline methods by up to 30% in terms of test accuracy while incurring a similar computational burden, for both standalone and domain adaptation settings.

Via

Access Paper or Ask Questions

Finding High-Value Training Data Subset through Differentiable Convex Programming

Apr 28, 2021

Soumi Das, Arshdeep Singh, Saptarshi Chatterjee, Suparna Bhattacharya, Sourangshu Bhattacharya

Figure 1 for Finding High-Value Training Data Subset through Differentiable Convex Programming

Figure 2 for Finding High-Value Training Data Subset through Differentiable Convex Programming

Figure 3 for Finding High-Value Training Data Subset through Differentiable Convex Programming

Figure 4 for Finding High-Value Training Data Subset through Differentiable Convex Programming

Abstract:Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the "value" of individual training datapoints have been proposed for explaining trained models. However, the value of a training datapoint also depends on other selected training datapoints - a notion that is not explicitly captured by existing methods. In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making our method scalable. This results in a parameterized convex subset selection problem that is amenable to a differentiable convex programming paradigm, thus allowing us to learn the parameters of the selection model in end-to-end training. Using this framework, we design an online alternating minimization-based algorithm for jointly learning the parameters of the selection model and ML model. Extensive evaluation on a synthetic dataset, and three standard datasets, show that our algorithm finds consistently higher value subsets of training data, compared to the recent state-of-the-art methods, sometimes ~20% higher value than existing methods. The subsets are also useful in finding mislabelled training data. Our algorithm takes running time comparable to the existing valuation functions.

Via

Access Paper or Ask Questions

Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving

Mar 24, 2021

Soumi Das, Harikrishna Patibandla, Suparna Bhattacharya, Kshounis Bera, Niloy Ganguly, Sourangshu Bhattacharya

Figure 1 for Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving

Figure 2 for Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving

Figure 3 for Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving

Figure 4 for Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving

Abstract:Training vision-based Urban Autonomous driving models is a challenging problem, which is highly researched in recent times. Training such models is a data-intensive task requiring the storage and processing of vast volumes of (possibly redundant) driving video data. In this paper, we study the problem of developing data-efficient autonomous driving systems. In this context, we study the problem of multi-criteria online video frame subset selection. We study convex optimization-based solutions and show that they are unable to provide solutions with high weightage to the loss of selected video frames. We design a novel convex optimization-based multi-criteria online subset selection algorithm that uses a thresholded concave function of selection variables. We also propose and study a submodular optimization-based algorithm. Extensive experiments using the driving simulator CARLA show that we are able to drop 80% of the frames while succeeding to complete 100% of the episodes w.r.t. the model trained on 100% data, in the most difficult task of taking turns. This results in a training time of less than 30% compared to training on the whole dataset. We also perform detailed experiments on prediction performances of various affordances used by the Conditional Affordance Learning (CAL) model and show that our subset selection improves performance on the crucial affordance "Relative Angle" during turns.

* Submitted to CVPR 2020

Via

Access Paper or Ask Questions

Map Enhanced Route Travel Time Prediction using Deep Neural Networks

Nov 06, 2019

Soumi Das, Rajath Nandan Kalava, Kolli Kiran Kumar, Akhil Kandregula, Kalpam Suhaas, Sourangshu Bhattacharya, Niloy Ganguly

Figure 1 for Map Enhanced Route Travel Time Prediction using Deep Neural Networks

Figure 2 for Map Enhanced Route Travel Time Prediction using Deep Neural Networks

Figure 3 for Map Enhanced Route Travel Time Prediction using Deep Neural Networks

Abstract:Travel time estimation is a fundamental problem in transportation science with extensive literature. The study of these techniques has intensified due to availability of many publicly available large trip datasets. Recently developed deep learning based models have improved the generality and performance and have focused on estimating times for individual sub-trajectories and aggregating them to predict the travel time of the entire trajectory. However, these techniques ignore the road network information. In this work, we propose and study techniques for incorporating road networks along with historical trips' data into travel time prediction. We incorporate both node embeddings as well as road distance into the existing model. Experiments on large real-world benchmark datasets suggest improved performance, especially when the train data is small. As expected, the proposed method performs better than the baseline when there is a larger difference between road distance and Vincenty distance between start and end points.

Via

Access Paper or Ask Questions