Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Harsh Chaudhari

Cascading Adversarial Bias from Injection to Distillation in Language Models

May 30, 2025

Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, Alina Oprea

Abstract:Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.

Via

Access Paper or Ask Questions

R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model

May 19, 2025

Ali Naseh, Harsh Chaudhari, Jaechul Roh, Mingshi Wu, Alina Oprea, Amir Houmansadr

Abstract:DeepSeek recently released R1, a high-performing large language model (LLM) optimized for reasoning tasks. Despite its efficient training pipeline, R1 achieves competitive performance, even surpassing leading reasoning models like OpenAI's o1 on several benchmarks. However, emerging reports suggest that R1 refuses to answer certain prompts related to politically sensitive topics in China. While existing LLMs often implement safeguards to avoid generating harmful or offensive outputs, R1 represents a notable shift - exhibiting censorship-like behavior on politically charged queries. In this paper, we investigate this phenomenon by first introducing a large-scale set of heavily curated prompts that get censored by R1, covering a range of politically sensitive topics, but are not censored by other models. We then conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context. Beyond English-language queries, we explore censorship behavior in other languages. We also investigate the transferability of censorship to models distilled from the R1 language model. Finally, we propose techniques for bypassing or removing this censorship. Our findings reveal possible additional censorship integration likely shaped by design choices during training or alignment, raising concerns about transparency, bias, and governance in language model deployment.

Via

Access Paper or Ask Questions

Measuring memorization through probabilistic discoverable extraction

Oct 25, 2024

Jamie Hayes, Marika Swanberg, Harsh Chaudhari, Itay Yona, Ilia Shumailov

Figure 1 for Measuring memorization through probabilistic discoverable extraction

Figure 2 for Measuring memorization through probabilistic discoverable extraction

Figure 3 for Measuring memorization through probabilistic discoverable extraction

Figure 4 for Measuring memorization through probabilistic discoverable extraction

Abstract:Large language models (LLMs) are susceptible to memorizing training data, raising concerns due to the potential extraction of sensitive information. Current methods to measure memorization rates of LLMs, primarily discoverable extraction (Carlini et al., 2022), rely on single-sequence greedy sampling, potentially underestimating the true extent of memorization. This paper introduces a probabilistic relaxation of discoverable extraction that quantifies the probability of extracting a target sequence within a set of generated samples, considering various sampling schemes and multiple attempts. This approach addresses the limitations of reporting memorization rates through discoverable extraction by accounting for the probabilistic nature of LLMs and user interaction patterns. Our experiments demonstrate that this probabilistic measure can reveal cases of higher memorization rates compared to rates found through discoverable extraction. We further investigate the impact of different sampling schemes on extractability, providing a more comprehensive and realistic assessment of LLM memorization and its associated risks. Our contributions include a new probabilistic memorization definition, empirical evidence of its effectiveness, and a thorough evaluation across different models, sizes, sampling schemes, and training data repetitions.

Via

Access Paper or Ask Questions

Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

May 30, 2024

Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A. Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, Alina Oprea

Abstract:Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs) in chatbot applications, enabling developers to adapt and personalize the LLM output without expensive training or fine-tuning. RAG systems use an external knowledge database to retrieve the most relevant documents for a given query, providing this context to the LLM generator. While RAG achieves impressive utility in many applications, its adoption to enable personalized generative models introduces new security risks. In this work, we propose new attack surfaces for an adversary to compromise a victim's RAG system, by injecting a single malicious document in its knowledge database. We design Phantom, general two-step attack framework against RAG augmented LLMs. The first step involves crafting a poisoned document designed to be retrieved by the RAG system within the top-k results only when an adversarial trigger, a specific sequence of words acting as backdoor, is present in the victim's queries. In the second step, a specially crafted adversarial string within the poisoned document triggers various adversarial attacks in the LLM generator, including denial of service, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama.

Via

Access Paper or Ask Questions

L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Dec 30, 2023

Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar, Raviraj Joshi

Figure 1 for L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Figure 2 for L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Figure 3 for L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Figure 4 for L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Abstract:This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP

* Accepted at Forum for Information Retrieval Evaluation (FIRE 2023)

Via

Access Paper or Ask Questions

On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi

Dec 03, 2023

Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar, Raviraj Joshi, Sachin Pande

Abstract:Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN, and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using sub-word tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.

* Accepted at ICDAM 2023

Via

Access Paper or Ask Questions

Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Oct 05, 2023

Harsh Chaudhari, Giorgio Severi, Alina Oprea, Jonathan Ullman

Figure 1 for Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Figure 2 for Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Figure 3 for Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Figure 4 for Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Abstract:The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs.

Via

Access Paper or Ask Questions

SNAP: Efficient Extraction of Private Properties with Poisoning

Aug 25, 2022

Harsh Chaudhari, John Abascal, Alina Oprea, Matthew Jagielski, Florian Tramèr, Jonathan Ullman

Figure 1 for SNAP: Efficient Extraction of Private Properties with Poisoning

Figure 2 for SNAP: Efficient Extraction of Private Properties with Poisoning

Figure 3 for SNAP: Efficient Extraction of Private Properties with Poisoning

Figure 4 for SNAP: Efficient Extraction of Private Properties with Poisoning

Abstract:Property inference attacks allow an adversary to extract global properties of the training dataset from a machine learning model. Such attacks have privacy implications for data owners who share their datasets to train machine learning models. Several existing approaches for property inference attacks against deep neural networks have been proposed, but they all rely on the attacker training a large number of shadow models, which induces large computational overhead. In this paper, we consider the setting of property inference attacks in which the attacker can poison a subset of the training dataset and query the trained target model. Motivated by our theoretical analysis of model confidences under poisoning, we design an efficient property inference attack, SNAP, which obtains higher attack success and requires lower amounts of poisoning than the state-of-the-art poisoning-based property inference attack by Mahloujifar et al. For example, on the Census dataset, SNAP achieves 34% higher success rate than Mahloujifar et al. while being 56.5x faster. We also extend our attack to determine if a certain property is present at all in training, and estimate the exact proportion of a property of interest efficiently. We evaluate our attack on several properties of varying proportions from four datasets, and demonstrate SNAP's generality and effectiveness.

* 27 pages, 13 figures

Via

Access Paper or Ask Questions

SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning

May 20, 2022

Harsh Chaudhari, Matthew Jagielski, Alina Oprea

Figure 1 for SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning

Figure 2 for SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning

Figure 3 for SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning

Figure 4 for SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning

Abstract:Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, the datasets used for training ML models might be under the control of an adversary mounting a data poisoning attack, and MPC prevents inspecting training sets to detect poisoning. We show that multiple MPC frameworks for private ML training are susceptible to backdoor and targeted poisoning attacks. To mitigate this, we propose SafeNet, a framework for building ensemble models in MPC with formal guarantees of robustness to data poisoning attacks. We extend the security definition of private ML training to account for poisoning and prove that our SafeNet design satisfies the definition. We demonstrate SafeNet's efficiency, accuracy, and resilience to poisoning on several machine learning datasets and models. For instance, SafeNet reduces backdoor attack success from 100% to 0% for a neural network model, while achieving 39x faster training and 36x less communication than the four-party MPC framework of Dalskov et al.

Via

Access Paper or Ask Questions

ASTRA: High Throughput 3PC over Rings with Application to Secure Prediction

Dec 05, 2019

Harsh Chaudhari, Ashish Choudhury, Arpita Patra, Ajith Suresh

Figure 1 for ASTRA: High Throughput 3PC over Rings with Application to Secure Prediction

Figure 2 for ASTRA: High Throughput 3PC over Rings with Application to Secure Prediction

Figure 3 for ASTRA: High Throughput 3PC over Rings with Application to Secure Prediction

Figure 4 for ASTRA: High Throughput 3PC over Rings with Application to Secure Prediction

Abstract:The concrete efficiency of secure computation has been the focus of many recent works. In this work, we present concretely-efficient protocols for secure $3$-party computation (3PC) over a ring of integers modulo $2^{\ell}$ tolerating one corruption, both with semi-honest and malicious security. Owing to the fact that computation over ring emulates computation over the real-world system architectures, secure computation over ring has gained momentum of late. Cast in the offline-online paradigm, our constructions present the most efficient online phase in concrete terms. In the semi-honest setting, our protocol requires communication of $2$ ring elements per multiplication gate during the {\it online} phase, attaining a per-party cost of {\em less than one element}. This is achieved for the first time in the regime of 3PC. In the {\it malicious} setting, our protocol requires communication of $4$ elements per multiplication gate during the online phase, beating the state-of-the-art protocol by $5$ elements. Realized with both the security notions of selective abort and fairness, the malicious protocol with fairness involves slightly more communication than its counterpart with abort security for the output gates {\em alone}. We apply our techniques from $3$PC in the regime of secure server-aided machine-learning (ML) inference for a range of prediction functions-- linear regression, linear SVM regression, logistic regression, and linear SVM classification. Our setting considers a model-owner with trained model parameters and a client with a query, with the latter willing to learn the prediction of her query based on the model parameters of the former. The inputs and computation are outsourced to a set of three non-colluding servers. Our constructions catering to both semi-honest and the malicious world, invariably perform better than the existing constructions.

* This article is the full and extended version of an article appeared in ACM CCSW 2019

Via

Access Paper or Ask Questions