Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wassim Bouaziz

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

Jun 17, 2025

Wassim Bouaziz, Mathurin Videau, Nicolas Usunier, El-Mahdi El-Mhamdi

Abstract:The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on memorization of training data, which LM providers try to limit. In this work, we demonstrate that indirect data poisoning (where the targeted behavior is absent from training data) is not only feasible but also allow to effectively protect a dataset and trace its use. Using gradient-based optimization prompt-tuning, we make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus. We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret and detect it with extremely high confidence ($p < 10^{-55}$) with a theoretically certifiable scheme. Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets never appearing in the training set.

* 18 pages, 12 figures

Via

Access Paper or Ask Questions

Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification

Mar 13, 2025

Wassim Bouaziz, El-Mahdi El-Mhamdi, Nicolas Usunier

Abstract:Protecting the use of audio datasets is a major concern for data owners, particularly with the recent rise of audio deep learning models. While watermarks can be used to protect the data itself, they do not allow to identify a deep learning model trained on a protected dataset. In this paper, we adapt to audio data the recently introduced data taggants approach. Data taggants is a method to verify if a neural network was trained on a protected image dataset with top-$k$ predictions access to the model only. This method relies on a targeted data poisoning scheme by discreetly altering a small fraction (1%) of the dataset as to induce a harmless behavior on out-of-distribution data called keys. We evaluate our method on the Speechcommands and the ESC50 datasets and state of the art transformer models, and show that we can detect the use of the dataset with high confidence without loss of performance. We also show the robustness of our method against common data augmentation techniques, making it a practical method to protect audio datasets.

* Published at ICASSP 2025, 5 pages, 7 figures

Via

Access Paper or Ask Questions

Easing Optimization Paths: a Circuit Perspective

Jan 04, 2025

Ambroise Odonnat, Wassim Bouaziz, Vivien Cabannes

Abstract:Gradient descent is the method of choice for training large artificial intelligence systems. As these systems become larger, a better understanding of the mechanisms behind gradient training would allow us to alleviate compute costs and help steer these systems away from harmful behaviors. To that end, we suggest utilizing the circuit perspective brought forward by mechanistic interpretability. After laying out our intuition, we illustrate how it enables us to design a curriculum for efficient learning in a controlled setting. The code is available at \url{https://github.com/facebookresearch/pal}.

* Accepted at ICASSP 2025

Via

Access Paper or Ask Questions

A Visual Case Study of the Training Dynamics in Neural Networks

Oct 31, 2024

Ambroise Odonnat, Wassim Bouaziz, Vivien Cabannes

Figure 1 for A Visual Case Study of the Training Dynamics in Neural Networks

Figure 2 for A Visual Case Study of the Training Dynamics in Neural Networks

Figure 3 for A Visual Case Study of the Training Dynamics in Neural Networks

Figure 4 for A Visual Case Study of the Training Dynamics in Neural Networks

Abstract:This paper introduces a visual sandbox designed to explore the training dynamics of a small-scale transformer model, with the embedding dimension constrained to $d=2$. This restriction allows for a comprehensive two-dimensional visualization of each layer's dynamics. Through this approach, we gain insights into training dynamics, circuit transferability, and the causes of loss spikes, including those induced by the high curvature of normalization layers. We propose strategies to mitigate these spikes, demonstrating how good visualization facilitates the design of innovative ideas of practical interest. Additionally, we believe our sandbox could assist theoreticians in assessing essential training dynamics mechanisms and integrating them into future theories. The code is available at https://github.com/facebookresearch/pal.

Via

Access Paper or Ask Questions

Inverting Gradient Attacks Naturally Makes Data Poisons: An Availability Attack on Neural Networks

Oct 28, 2024

Wassim Bouaziz, El-Mahdi El-Mhamdi, Nicolas Usunier

Figure 1 for Inverting Gradient Attacks Naturally Makes Data Poisons: An Availability Attack on Neural Networks

Figure 2 for Inverting Gradient Attacks Naturally Makes Data Poisons: An Availability Attack on Neural Networks

Figure 3 for Inverting Gradient Attacks Naturally Makes Data Poisons: An Availability Attack on Neural Networks

Figure 4 for Inverting Gradient Attacks Naturally Makes Data Poisons: An Availability Attack on Neural Networks

Abstract:Gradient attacks and data poisoning tamper with the training of machine learning algorithms to maliciously alter them and have been proven to be equivalent in convex settings. The extent of harm these attacks can produce in non-convex settings is still to be determined. Gradient attacks can affect far less systems than data poisoning but have been argued to be more harmful since they can be arbitrary, whereas data poisoning reduces the attacker's power to only being able to inject data points to training sets, via e.g. legitimate participation in a collaborative dataset. This raises the question of whether the harm made by gradient attacks can be matched by data poisoning in non-convex settings. In this work, we provide a positive answer in a worst-case scenario and show how data poisoning can mimic a gradient attack to perform an availability attack on (non-convex) neural networks. Through gradient inversion, commonly used to reconstruct data points from actual gradients, we show how reconstructing data points out of malicious gradients can be sufficient to perform a range of attacks. This allows us to show, for the first time, an availability attack on neural networks through data poisoning, that degrades the model's performances to random-level through a minority (as low as 1%) of poisoned points.

* 8 pages, 10 figures

Via

Access Paper or Ask Questions

Data Taggants: Dataset Ownership Verification via Harmless Targeted Data Poisoning

Oct 09, 2024

Wassim Bouaziz, El-Mahdi El-Mhamdi, Nicolas Usunier

Abstract:Dataset ownership verification, the process of determining if a dataset is used in a model's training data, is necessary for detecting unauthorized data usage and data contamination. Existing approaches, such as backdoor watermarking, rely on inducing a detectable behavior into the trained model on a part of the data distribution. However, these approaches have limitations, as they can be harmful to the model's performances or require unpractical access to the model's internals. Most importantly, previous approaches lack guarantee against false positives. This paper introduces data taggants, a novel non-backdoor dataset ownership verification technique. Our method uses pairs of out-of-distribution samples and random labels as secret keys, and leverages clean-label targeted data poisoning to subtly alter a dataset, so that models trained on it respond to the key samples with the corresponding key labels. The keys are built as to allow for statistical certificates with black-box access only to the model. We validate our approach through comprehensive and realistic experiments on ImageNet1k using ViT and ResNet models with state-of-the-art training recipes. Our findings demonstrate that data taggants can reliably make models trained on the protected dataset detectable with high confidence, without compromising validation accuracy, and demonstrates superiority over backdoor watermarking. Moreover, our method shows to be stealthy and robust against various defense mechanisms.

* 16 pages, 7 figures

Via

Access Paper or Ask Questions

Iteration Head: A Mechanistic Study of Chain-of-Thought

Jun 04, 2024

Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Alice Yang, Francois Charton, Julia Kempe

Abstract:Chain-of-Thought (CoT) reasoning is known to improve Large Language Models both empirically and in terms of theoretical approximation power. However, our understanding of the inner workings and conditions of apparition of CoT capabilities remains limited. This paper helps fill this gap by demonstrating how CoT reasoning emerges in transformers in a controlled and interpretable setting. In particular, we observe the appearance of a specialized attention mechanism dedicated to iterative reasoning, which we coined "iteration heads". We track both the emergence and the precise working of these iteration heads down to the attention level, and measure the transferability of the CoT skills to which they give rise between tasks.

Via

Access Paper or Ask Questions