Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aleksander Mądry

Tony

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Jul 15, 2025

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan(+31 more)

Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Abstract:AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Via

Access Paper or Ask Questions

AI Supply Chains: An Emerging Ecosystem of AI Actors, Products, and Services

Apr 28, 2025

Aspen Hopkins, Sarah H. Cen, Andrew Ilyas, Isabella Struckman, Luis Videgaray, Aleksander Mądry

Abstract:The widespread adoption of AI in recent years has led to the emergence of AI supply chains: complex networks of AI actors contributing models, datasets, and more to the development of AI products and services. AI supply chains have many implications yet are poorly understood. In this work, we take a first step toward a formal study of AI supply chains and their implications, providing two illustrative case studies indicating that both AI development and regulation are complicated in the presence of supply chains. We begin by presenting a brief historical perspective on AI supply chains, discussing how their rise reflects a longstanding shift towards specialization and outsourcing that signals the healthy growth of the AI industry. We then model AI supply chains as directed graphs and demonstrate the power of this abstraction by connecting examples of AI issues to graph properties. Finally, we examine two case studies in detail, providing theoretical and empirical results in both. In the first, we show that information passing (specifically, of explanations) along the AI supply chains is imperfect, which can result in misunderstandings that have real-world implications. In the second, we show that upstream design choices (e.g., by base model providers) have downstream consequences (e.g., on AI products fine-tuned on the base model). Together, our findings motivate further study of AI supply chains and their increasingly salient social, economic, regulatory, and technical implications.

* 27 pages, 8 figures

Via

Access Paper or Ask Questions

GPT-4o System Card

Oct 25, 2024

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda(+409 more)

Abstract:GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

Via

Access Paper or Ask Questions

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Oct 09, 2024

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan(+2 more)

Figure 1 for MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Figure 2 for MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Figure 3 for MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Figure 4 for MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Abstract:We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

* 10 pages. Plus 17 pages appendix. 8 figures. Equal contribution by first seven authors. Authors randomized. Work by Neil Chowdhury done while at OpenAI

Via

Access Paper or Ask Questions

Certified Patch Robustness via Smoothed Vision Transformers

Oct 11, 2021

Hadi Salman, Saachi Jain, Eric Wong, Aleksander Mądry

Figure 1 for Certified Patch Robustness via Smoothed Vision Transformers

Figure 2 for Certified Patch Robustness via Smoothed Vision Transformers

Figure 3 for Certified Patch Robustness via Smoothed Vision Transformers

Figure 4 for Certified Patch Robustness via Smoothed Vision Transformers

Abstract:Certified patch defenses can guarantee robustness of an image classifier to arbitrary changes within a bounded contiguous region. But, currently, this robustness comes at a cost of degraded standard accuracies and slower inference times. We demonstrate how using vision transformers enables significantly better certified patch robustness that is also more computationally efficient and does not incur a substantial drop in standard accuracy. These improvements stem from the inherent ability of the vision transformer to gracefully handle largely masked images. Our code is available at https://github.com/MadryLab/smoothed-vit.

Via

Access Paper or Ask Questions

Leveraging Sparse Linear Layers for Debuggable Deep Networks

May 11, 2021

Eric Wong, Shibani Santurkar, Aleksander Mądry

Figure 1 for Leveraging Sparse Linear Layers for Debuggable Deep Networks

Figure 2 for Leveraging Sparse Linear Layers for Debuggable Deep Networks

Figure 3 for Leveraging Sparse Linear Layers for Debuggable Deep Networks

Figure 4 for Leveraging Sparse Linear Layers for Debuggable Deep Networks

Abstract:We show how fitting sparse linear models over learned deep feature representations can lead to more debuggable neural networks. These networks remain highly accurate while also being more amenable to human interpretation, as we demonstrate quantiatively via numerical and human experiments. We further illustrate how the resulting sparse explanations can help to identify spurious correlations, explain misclassifications, and diagnose model biases in vision and language tasks. The code for our toolkit can be found at https://github.com/madrylab/debuggabledeepnetworks.

Via

Access Paper or Ask Questions

A Classification-Based Study of Covariate Shift in GAN Distributions

Jun 06, 2018

Shibani Santurkar, Ludwig Schmidt, Aleksander Mądry

Figure 1 for A Classification-Based Study of Covariate Shift in GAN Distributions

Figure 2 for A Classification-Based Study of Covariate Shift in GAN Distributions

Figure 3 for A Classification-Based Study of Covariate Shift in GAN Distributions

Figure 4 for A Classification-Based Study of Covariate Shift in GAN Distributions

Abstract:A basic, and still largely unanswered, question in the context of Generative Adversarial Networks (GANs) is whether they are truly able to capture all the fundamental characteristics of the distributions they are trained on. In particular, evaluating the diversity of GAN distributions is challenging and existing methods provide only a partial understanding of this issue. In this paper, we develop quantitative and scalable tools for assessing the diversity of GAN distributions. Specifically, we take a classification-based perspective and view loss of diversity as a form of covariate shift introduced by GANs. We examine two specific forms of such shift: mode collapse and boundary distortion. In contrast to prior work, our methods need only minimal human supervision and can be readily applied to state-of-the-art GANs on large, canonical datasets. Examining popular GANs using our tools indicates that these GANs have significant problems in reproducing the more distributional properties of their training dataset.

Via

Access Paper or Ask Questions

Adversarially Robust Generalization Requires More Data

May 02, 2018

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, Aleksander Mądry

Figure 1 for Adversarially Robust Generalization Requires More Data

Figure 2 for Adversarially Robust Generalization Requires More Data

Figure 3 for Adversarially Robust Generalization Requires More Data

Figure 4 for Adversarially Robust Generalization Requires More Data

Abstract:Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce an incorrect prediction with high confidence. To better understand this phenomenon, we study adversarially robust learning from the viewpoint of generalization. We show that already in a simple natural data model, the sample complexity of robust learning can be significantly larger than that of "standard" learning. This gap is information theoretic and holds irrespective of the training algorithm or the model family. We complement our theoretical results with experiments on popular image classification datasets and show that a similar gap exists here as well. We postulate that the difficulty of training robust classifiers stems, at least partially, from this inherently larger sample complexity.

* Small changes for biblatex compatibility

Via

Access Paper or Ask Questions