Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Wei

OpenAI o1 System Card

Dec 21, 2024

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry(+253 more)

Abstract:The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

Via

Access Paper or Ask Questions

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Jun 28, 2024

Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt

Figure 1 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Figure 2 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Figure 3 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Figure 4 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Abstract:Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

* 22 pages

Via

Access Paper or Ask Questions

Jailbroken: How Does LLM Safety Training Fail?

Jul 05, 2023

Alexander Wei, Nika Haghtalab, Jacob Steinhardt

Abstract:Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.

Via

Access Paper or Ask Questions

TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

Jul 13, 2022

Yaodong Yu, Alexander Wei, Sai Praneeth Karimireddy, Yi Ma, Michael I. Jordan

Figure 1 for TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

Figure 2 for TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

Figure 3 for TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

Figure 4 for TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

Abstract:State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can largely be attributed to optimization challenges presented by nonconvexity. Specifically, we find that the early layers of the network do learn useful features, but the final layers fail to make use of them. That is, federated optimization applied to this non-convex problem distorts the learning of the final layers. Leveraging this observation, we propose a Train-Convexify-Train (TCT) procedure to sidestep this issue: first, learn features using off-the-shelf methods (e.g., FedAvg); then, optimize a convexified problem obtained from the network's empirical neural tangent kernel approximation. Our technique yields accuracy improvements of up to +36% on FMNIST and +37% on CIFAR10 when clients have dissimilar data.

Via

Access Paper or Ask Questions

More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

Mar 11, 2022

Alexander Wei, Wei Hu, Jacob Steinhardt

Figure 1 for More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

Figure 2 for More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

Figure 3 for More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

Figure 4 for More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

Abstract:Of theories for why large-scale machine learning models generalize despite being vastly overparameterized, which of their assumptions are needed to capture the qualitative phenomena of generalization in the real world? On one hand, we find that most theoretical analyses fall short of capturing these qualitative phenomena even for kernel regression, when applied to kernels derived from large-scale neural networks (e.g., ResNet-50) and real data (e.g., CIFAR-100). On the other hand, we find that the classical GCV estimator (Craven and Wahba, 1978) accurately predicts generalization risk even in such overparameterized settings. To bolster this empirical finding, we prove that the GCV estimator converges to the generalization risk whenever a local random matrix law holds. Finally, we apply this random matrix theory lens to explain why pretrained representations generalize better as well as what factors govern scaling laws for kernel regression. Our findings suggest that random matrix theory, rather than just being a toy model, may be central to understanding the properties of neural representations in practice.

Via

Access Paper or Ask Questions

Predicting Out-of-Distribution Error with the Projection Norm

Feb 11, 2022

Yaodong Yu, Zitong Yang, Alexander Wei, Yi Ma, Jacob Steinhardt

Figure 1 for Predicting Out-of-Distribution Error with the Projection Norm

Figure 2 for Predicting Out-of-Distribution Error with the Projection Norm

Figure 3 for Predicting Out-of-Distribution Error with the Projection Norm

Figure 4 for Predicting Out-of-Distribution Error with the Projection Norm

Abstract:We propose a metric -- Projection Norm -- to predict a model's performance on out-of-distribution (OOD) data without access to ground truth labels. Projection Norm first uses model predictions to pseudo-label test samples and then trains a new model on the pseudo-labels. The more the new model's parameters differ from an in-distribution model, the greater the predicted OOD error. Empirically, our approach outperforms existing methods on both image and text classification tasks and across different network architectures. Theoretically, we connect our approach to a bound on the test error for overparameterized linear models. Furthermore, we find that Projection Norm is the only approach that achieves non-trivial detection performance on adversarial examples. Our code is available at https://github.com/yaodongyu/ProjNorm.

Via

Access Paper or Ask Questions

Learning Equilibria in Matching Markets from Bandit Feedback

Aug 19, 2021

Meena Jagadeesan, Alexander Wei, Yixin Wang, Michael I. Jordan, Jacob Steinhardt

Figure 1 for Learning Equilibria in Matching Markets from Bandit Feedback

Figure 2 for Learning Equilibria in Matching Markets from Bandit Feedback

Abstract:Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. However, since preferences are inherently uncertain during learning, the classical notion of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) is unattainable in these settings. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty," the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.

Via

Access Paper or Ask Questions

Optimal Robustness-Consistency Trade-offs for Learning-Augmented Online Algorithms

Oct 22, 2020

Alexander Wei, Fred Zhang

Figure 1 for Optimal Robustness-Consistency Trade-offs for Learning-Augmented Online Algorithms

Figure 2 for Optimal Robustness-Consistency Trade-offs for Learning-Augmented Online Algorithms

Abstract:We study the problem of improving the performance of online algorithms by incorporating machine-learned predictions. The goal is to design algorithms that are both consistent and robust, meaning that the algorithm performs well when predictions are accurate and maintains worst-case guarantees. Such algorithms have been studied in a recent line of works due to Lykouris and Vassilvitskii (ICML '18) and Purohit et al (NeurIPS '18). They provide robustness-consistency trade-offs for a variety of online problems. However, they leave open the question of whether these trade-offs are tight, i.e., to what extent to such trade-offs are necessary. In this paper, we provide the first set of non-trivial lower bounds for competitive analysis using machine-learned predictions. We focus on the classic problems of ski-rental and non-clairvoyant scheduling and provide optimal trade-offs in various settings.

* To appear at NeurIPS 2020

Via

Access Paper or Ask Questions