Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guillaume Leclerc

OpenAI o1 System Card

Dec 21, 2024

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry(+253 more)

Abstract:The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

Via

Access Paper or Ask Questions

Rethinking Backdoor Attacks

Jul 19, 2023

Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry

Figure 1 for Rethinking Backdoor Attacks

Figure 2 for Rethinking Backdoor Attacks

Figure 3 for Rethinking Backdoor Attacks

Figure 4 for Rethinking Backdoor Attacks

Abstract:In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them. In this work, we present a different approach to the backdoor attack problem. Specifically, we show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data--and thus impossible to "detect" in a general sense. Then, guided by this observation, we revisit existing defenses against backdoor attacks and characterize the (often latent) assumptions they make and on which they depend. Finally, we explore an alternative perspective on backdoor attacks: one that assumes these attacks correspond to the strongest feature in the training data. Under this assumption (which we make formal) we develop a new primitive for detecting backdoor attacks. Our primitive naturally gives rise to a detection algorithm that comes with theoretical guarantees and is effective in practice.

* ICML 2023

Via

Access Paper or Ask Questions

FFCV: Accelerating Training by Removing Data Bottlenecks

Jun 21, 2023

Guillaume Leclerc, Andrew Ilyas, Logan Engstrom, Sung Min Park, Hadi Salman, Aleksander Madry

Figure 1 for FFCV: Accelerating Training by Removing Data Bottlenecks

Figure 2 for FFCV: Accelerating Training by Removing Data Bottlenecks

Figure 3 for FFCV: Accelerating Training by Removing Data Bottlenecks

Figure 4 for FFCV: Accelerating Training by Removing Data Bottlenecks

Abstract:We present FFCV, a library for easy and fast machine learning model training. FFCV speeds up model training by eliminating (often subtle) data bottlenecks from the training process. In particular, we combine techniques such as an efficient file storage format, caching, data pre-loading, asynchronous data transfer, and just-in-time compilation to (a) make data loading and transfer significantly more efficient, ensuring that GPUs can reach full utilization; and (b) offload as much data processing as possible to the CPU asynchronously, freeing GPU cycles for training. Using FFCV, we train ResNet-18 and ResNet-50 on the ImageNet dataset with competitive tradeoff between accuracy and training time. For example, we are able to train an ImageNet ResNet-50 model to 75\% in only 20 mins on a single machine. We demonstrate FFCV's performance, ease-of-use, extensibility, and ability to adapt to resource constraints through several case studies. Detailed installation instructions, documentation, and Slack support channel are available at https://ffcv.io/ .

Via

Access Paper or Ask Questions

TRAK: Attributing Model Behavior at Scale

Apr 03, 2023

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry

Figure 1 for TRAK: Attributing Model Behavior at Scale

Figure 2 for TRAK: Attributing Model Behavior at Scale

Figure 3 for TRAK: Attributing Model Behavior at Scale

Figure 4 for TRAK: Attributing Model Behavior at Scale

Abstract:The goal of data attribution is to trace model predictions back to training data. Despite a long line of work towards this goal, existing approaches to data attribution tend to force users to choose between computational tractability and efficacy. That is, computationally tractable methods can struggle with accurately attributing model predictions in non-convex settings (e.g., in the context of deep neural networks), while methods that are effective in such regimes require training thousands of models, which makes them impractical for large models or datasets. In this work, we introduce TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differentiable models. In particular, by leveraging only a handful of trained models, TRAK can match the performance of attribution methods that require training thousands of models. We demonstrate the utility of TRAK across various modalities and scales: image classifiers trained on ImageNet, vision-language models (CLIP), and language models (BERT and mT5). We provide code for using TRAK (and reproducing our work) at https://github.com/MadryLab/trak .

Via

Access Paper or Ask Questions

Raising the Cost of Malicious AI-Powered Image Editing

Feb 13, 2023

Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry

Abstract:We present an approach to mitigating the risks of malicious image editing posed by large diffusion models. The key idea is to immunize images so as to make them resistant to manipulation by these models. This immunization relies on injection of imperceptible adversarial perturbations designed to disrupt the operation of the targeted diffusion models, forcing them to generate unrealistic images. We provide two methods for crafting such perturbations, and then demonstrate their efficacy. Finally, we discuss a policy component necessary to make our approach fully effective and practical -- one that involves the organizations developing diffusion models, rather than individual users, to implement (and support) the immunization process.

Via

Access Paper or Ask Questions

Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Jun 19, 2022

Chong Guo, Michael J. Lee, Guillaume Leclerc, Joel Dapello, Yug Rao, Aleksander Madry, James J. DiCarlo

Figure 1 for Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Figure 2 for Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Figure 3 for Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Figure 4 for Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Abstract:Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that the above-mentioned belief might not be well founded. Specifically, we report that the biological neurons that make up visual systems of primates exhibit susceptibility to adversarial perturbations that is comparable in magnitude to existing (robustly trained) artificial neural networks.

* 10 pages, 6 figures, ICML2022

Via

Access Paper or Ask Questions

Datamodels: Predicting Predictions from Training Data

Feb 01, 2022

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

Figure 1 for Datamodels: Predicting Predictions from Training Data

Figure 2 for Datamodels: Predicting Predictions from Training Data

Figure 3 for Datamodels: Predicting Predictions from Training Data

Figure 4 for Datamodels: Predicting Predictions from Training Data

Abstract:We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .

Via

Access Paper or Ask Questions

3DB: A Framework for Debugging Computer Vision Models

Jun 07, 2021

Guillaume Leclerc, Hadi Salman, Andrew Ilyas, Sai Vemprala, Logan Engstrom, Vibhav Vineet, Kai Xiao, Pengchuan Zhang, Shibani Santurkar, Greg Yang(+2 more)

Figure 1 for 3DB: A Framework for Debugging Computer Vision Models

Figure 2 for 3DB: A Framework for Debugging Computer Vision Models

Figure 3 for 3DB: A Framework for Debugging Computer Vision Models

Figure 4 for 3DB: A Framework for Debugging Computer Vision Models

Abstract:We introduce 3DB: an extendable, unified framework for testing and debugging vision models using photorealistic simulation. We demonstrate, through a wide range of use cases, that 3DB allows users to discover vulnerabilities in computer vision systems and gain insights into how models make decisions. 3DB captures and generalizes many robustness analyses from prior work, and enables one to study their interplay. Finally, we find that the insights generated by the system transfer to the physical world. We are releasing 3DB as a library (https://github.com/3db/3db) alongside a set of example analyses, guides, and documentation: https://3db.github.io/3db/ .

Via

Access Paper or Ask Questions

Revisiting Ensembles in an Adversarial Context: Improving Natural Accuracy

Feb 26, 2020

Aditya Saligrama, Guillaume Leclerc

Figure 1 for Revisiting Ensembles in an Adversarial Context: Improving Natural Accuracy

Figure 2 for Revisiting Ensembles in an Adversarial Context: Improving Natural Accuracy

Abstract:A necessary characteristic for the deployment of deep learning models in real world applications is resistance to small adversarial perturbations while maintaining accuracy on non-malicious inputs. While robust training provides models that exhibit better adversarial accuracy than standard models, there is still a significant gap in natural accuracy between robust and non-robust models which we aim to bridge. We consider a number of ensemble methods designed to mitigate this performance difference. Our key insight is that model trained to withstand small attacks, when ensembled, can often withstand significantly larger attacks, and this concept can in turn be leveraged to optimize natural accuracy. We consider two schemes, one that combines predictions from several randomly initialized robust models, and the other that fuses features from robust and standard models.

* 5 pages, accepted to ICLR 2020 Workshop on Towards Trustworthy ML: Rethinking Security and Privacy for ML

Via

Access Paper or Ask Questions

The Two Regimes of Deep Network Training

Feb 24, 2020

Guillaume Leclerc, Aleksander Madry

Figure 1 for The Two Regimes of Deep Network Training

Figure 2 for The Two Regimes of Deep Network Training

Figure 3 for The Two Regimes of Deep Network Training

Figure 4 for The Two Regimes of Deep Network Training

Abstract:Learning rate schedule has a major impact on the performance of deep learning models. Still, the choice of a schedule is often heuristical. We aim to develop a precise understanding of the effects of different learning rate schedules and the appropriate way to select them. To this end, we isolate two distinct phases of training, the first, which we refer to as the "large-step" regime, exhibits a rather poor performance from an optimization point of view but is the primary contributor to model generalization; the latter, "small-step" regime exhibits much more "convex-like" optimization behavior but used in isolation produces models that generalize poorly. We find that by treating these regimes separately-and em specializing our training algorithm to each one of them, we can significantly simplify learning rate schedules.

* 14 pages (5 of appendix), 14 figures

Via

Access Paper or Ask Questions