Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kai Xiao

Trading Inference-Time Compute for Adversarial Robustness

Jan 31, 2025

Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke(+1 more)

Figure 1 for Trading Inference-Time Compute for Adversarial Robustness

Figure 2 for Trading Inference-Time Compute for Adversarial Robustness

Figure 3 for Trading Inference-Time Compute for Adversarial Robustness

Figure 4 for Trading Inference-Time Compute for Adversarial Robustness

Abstract:We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.

Via

Access Paper or Ask Questions

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Dec 24, 2024

Alex Beutel, Kai Xiao, Johannes Heidecke, Lilian Weng

Figure 1 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Figure 2 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Figure 3 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Figure 4 for Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Abstract:Automated red teaming can discover rare model failures and generate challenging examples that can be used for training or evaluation. However, a core challenge in automated red teaming is ensuring that the attacks are both diverse and effective. Prior methods typically succeed in optimizing either for diversity or for effectiveness, but rarely both. In this paper, we provide methods that enable automated red teaming to generate a large number of diverse and successful attacks. Our approach decomposes the task into two steps: (1) automated methods for generating diverse attack goals and (2) generating effective attacks for those goals. While we provide multiple straightforward methods for generating diverse goals, our key contributions are to train an RL attacker that both follows those goals and generates diverse attacks for those goals. First, we demonstrate that it is easy to use a large language model (LLM) to generate diverse attacker goals with per-goal prompts and rewards, including rule-based rewards (RBRs) to grade whether the attacks are successful for the particular goal. Second, we demonstrate how training the attacker model with multi-step RL, where the model is rewarded for generating attacks that are different from past attempts further increases diversity while remaining effective. We use our approach to generate both prompt injection attacks and prompts that elicit unsafe responses. In both cases, we find that our approach is able to generate highly-effective and considerably more diverse attacks than past general red-teaming approaches.

Via

Access Paper or Ask Questions

OpenAI o1 System Card

Dec 21, 2024

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry(+253 more)

Abstract:The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

Via

Access Paper or Ask Questions

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Apr 19, 2024

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, Alex Beutel

Figure 1 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Figure 2 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Figure 3 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Figure 4 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Abstract:Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

Via

Access Paper or Ask Questions

Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding

Apr 06, 2023

Yuke Hu, Wei Liang, Ruofan Wu, Kai Xiao, Weiqiang Wang, Xiaochen Li, Jinfei Liu, Zhan Qin

Figure 1 for Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding

Figure 2 for Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding

Figure 3 for Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding

Figure 4 for Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding

Abstract:Knowledge Graph Embedding (KGE) is a fundamental technique that extracts expressive representation from knowledge graph (KG) to facilitate diverse downstream tasks. The emerging federated KGE (FKGE) collaboratively trains from distributed KGs held among clients while avoiding exchanging clients' sensitive raw KGs, which can still suffer from privacy threats as evidenced in other federated model trainings (e.g., neural networks). However, quantifying and defending against such privacy threats remain unexplored for FKGE which possesses unique properties not shared by previously studied models. In this paper, we conduct the first holistic study of the privacy threat on FKGE from both attack and defense perspectives. For the attack, we quantify the privacy threat by proposing three new inference attacks, which reveal substantial privacy risk by successfully inferring the existence of the KG triple from victim clients. For the defense, we propose DP-Flames, a novel differentially private FKGE with private selection, which offers a better privacy-utility tradeoff by exploiting the entity-binding sparse gradient property of FKGE and comes with a tight privacy accountant by incorporating the state-of-the-art private selection technique. We further propose an adaptive privacy budget allocation policy to dynamically adjust defense magnitude across the training procedure. Comprehensive evaluations demonstrate that the proposed defense can successfully mitigate the privacy threat by effectively reducing the success rate of inference attacks from $83.1\%$ to $59.4\%$ on average with only a modest utility decrease.

* Accepted in the ACM Web Conference (WWW 2023)

Via

Access Paper or Ask Questions

On Distinctive Properties of Universal Perturbations

Dec 31, 2021

Sung Min Park, Kuo-An Wei, Kai Xiao, Jerry Li, Aleksander Madry

Figure 1 for On Distinctive Properties of Universal Perturbations

Figure 2 for On Distinctive Properties of Universal Perturbations

Figure 3 for On Distinctive Properties of Universal Perturbations

Figure 4 for On Distinctive Properties of Universal Perturbations

Abstract:We identify properties of universal adversarial perturbations (UAPs) that distinguish them from standard adversarial perturbations. Specifically, we show that targeted UAPs generated by projected gradient descent exhibit two human-aligned properties: semantic locality and spatial invariance, which standard targeted adversarial perturbations lack. We also demonstrate that UAPs contain significantly less signal for generalization than standard adversarial perturbations -- that is, UAPs leverage non-robust features to a smaller extent than standard adversarial perturbations.

Via

Access Paper or Ask Questions

SHORING: Design Provable Conditional High-Order Interaction Network via Symbolic Testing

Jul 03, 2021

Hui Li, Xing Fu, Ruofan Wu, Jinyu Xu, Kai Xiao, Xiaofu Chang, Weiqiang Wang, Shuai Chen, Leilei Shi, Tao Xiong(+1 more)

Figure 1 for SHORING: Design Provable Conditional High-Order Interaction Network via Symbolic Testing

Figure 2 for SHORING: Design Provable Conditional High-Order Interaction Network via Symbolic Testing

Figure 3 for SHORING: Design Provable Conditional High-Order Interaction Network via Symbolic Testing

Figure 4 for SHORING: Design Provable Conditional High-Order Interaction Network via Symbolic Testing

Abstract:Deep learning provides a promising way to extract effective representations from raw data in an end-to-end fashion and has proven its effectiveness in various domains such as computer vision, natural language processing, etc. However, in domains such as content/product recommendation and risk management, where sequence of event data is the most used raw data form and experts derived features are more commonly used, deep learning models struggle to dominate the game. In this paper, we propose a symbolic testing framework that helps to answer the question of what kinds of expert-derived features could be learned by a neural network. Inspired by this testing framework, we introduce an efficient architecture named SHORING, which contains two components: \textit{event network} and \textit{sequence network}. The \textit{event} network learns arbitrarily yet efficiently high-order \textit{event-level} embeddings via a provable reparameterization trick, the \textit{sequence} network aggregates from sequence of \textit{event-level} embeddings. We argue that SHORING is capable of learning certain standard symbolic expressions which the standard multi-head self-attention network fails to learn, and conduct comprehensive experiments and ablation studies on four synthetic datasets and three real-world datasets. The results show that SHORING empirically outperforms the state-of-the-art methods.

* 18 pages, 4 figures

Via

Access Paper or Ask Questions

3DB: A Framework for Debugging Computer Vision Models

Jun 07, 2021

Guillaume Leclerc, Hadi Salman, Andrew Ilyas, Sai Vemprala, Logan Engstrom, Vibhav Vineet, Kai Xiao, Pengchuan Zhang, Shibani Santurkar, Greg Yang(+2 more)

Figure 1 for 3DB: A Framework for Debugging Computer Vision Models

Figure 2 for 3DB: A Framework for Debugging Computer Vision Models

Figure 3 for 3DB: A Framework for Debugging Computer Vision Models

Figure 4 for 3DB: A Framework for Debugging Computer Vision Models

Abstract:We introduce 3DB: an extendable, unified framework for testing and debugging vision models using photorealistic simulation. We demonstrate, through a wide range of use cases, that 3DB allows users to discover vulnerabilities in computer vision systems and gain insights into how models make decisions. 3DB captures and generalizes many robustness analyses from prior work, and enables one to study their interplay. Finally, we find that the insights generated by the system transfer to the physical world. We are releasing 3DB as a library (https://github.com/3db/3db) alongside a set of example analyses, guides, and documentation: https://3db.github.io/3db/ .

Via

Access Paper or Ask Questions

Noise or Signal: The Role of Image Backgrounds in Object Recognition

Jun 17, 2020

Kai Xiao, Logan Engstrom, Andrew Ilyas, Aleksander Madry

Figure 1 for Noise or Signal: The Role of Image Backgrounds in Object Recognition

Figure 2 for Noise or Signal: The Role of Image Backgrounds in Object Recognition

Figure 3 for Noise or Signal: The Role of Image Backgrounds in Object Recognition

Figure 4 for Noise or Signal: The Role of Image Backgrounds in Object Recognition

Abstract:We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 87.5% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance.

Via

Access Paper or Ask Questions

Evaluating Robustness of Neural Networks with Mixed Integer Programming

Jun 11, 2018

Vincent Tjeng, Kai Xiao, Russ Tedrake

Figure 1 for Evaluating Robustness of Neural Networks with Mixed Integer Programming

Figure 2 for Evaluating Robustness of Neural Networks with Mixed Integer Programming

Figure 3 for Evaluating Robustness of Neural Networks with Mixed Integer Programming

Figure 4 for Evaluating Robustness of Neural Networks with Mixed Integer Programming

Abstract:Neural networks have demonstrated considerable success on a wide variety of real-world problems. However, networks trained only to optimize for training accuracy can often be fooled by adversarial examples - slightly perturbed inputs that are misclassified with high confidence. Verification of networks enables us to gauge their vulnerability to such adversarial examples. We formulate verification of piecewise-linear neural networks as a mixed integer program. On a representative task of finding minimum adversarial distortions, our verifier is two to three orders of magnitude quicker than the state-of-the-art. We achieve this computational speedup via tight formulations for non-linearities, as well as a novel presolve algorithm that makes full use of all information available. The computational speedup allows us to verify properties on convolutional networks with an order of magnitude more ReLUs than networks previously verified by any complete verifier. In particular, we determine for the first time the exact adversarial accuracy of an MNIST classifier to perturbations with bounded $l_\infty$ norm $\epsilon=0.1$: for this classifier, we find an adversarial example for 4.38% of samples, and a certificate of robustness (to perturbations with bounded norm) for the remainder. Across all robust training procedures and network architectures considered, we are able to certify more samples than the state-of-the-art and find more adversarial examples than a strong first-order attack.

Via

Access Paper or Ask Questions