Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Melody Y. Guan

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Mar 14, 2025

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi

Abstract:Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.

Via

Access Paper or Ask Questions

Deliberative Alignment: Reasoning Enables Safer Language Models

Dec 20, 2024

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei(+5 more)

Abstract:As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.

* 24 pages

Via

Access Paper or Ask Questions

Using Computer Vision to Automate Hand Detection and Tracking of Surgeon Movements in Videos of Open Surgery

Dec 13, 2020

Michael Zhang, Xiaotian Cheng, Daniel Copeland, Arjun Desai, Melody Y. Guan, Gabriel A. Brat, Serena Yeung

Figure 1 for Using Computer Vision to Automate Hand Detection and Tracking of Surgeon Movements in Videos of Open Surgery

Figure 2 for Using Computer Vision to Automate Hand Detection and Tracking of Surgeon Movements in Videos of Open Surgery

Figure 3 for Using Computer Vision to Automate Hand Detection and Tracking of Surgeon Movements in Videos of Open Surgery

Figure 4 for Using Computer Vision to Automate Hand Detection and Tracking of Surgeon Movements in Videos of Open Surgery

Abstract:Open, or non-laparoscopic surgery, represents the vast majority of all operating room procedures, but few tools exist to objectively evaluate these techniques at scale. Current efforts involve human expert-based visual assessment. We leverage advances in computer vision to introduce an automated approach to video analysis of surgical execution. A state-of-the-art convolutional neural network architecture for object detection was used to detect operating hands in open surgery videos. Automated assessment was expanded by combining model predictions with a fast object tracker to enable surgeon-specific hand tracking. To train our model, we used publicly available videos of open surgery from YouTube and annotated these with spatial bounding boxes of operating hands. Our model's spatial detections of operating hands significantly outperforms the detections achieved using pre-existing hand-detection datasets, and allow for insights into intra-operative movement patterns and economy of motion.

* AMIA 2020 Annual Symposium

Via

Access Paper or Ask Questions

A Surprising Density of Illusionable Natural Speech

Jun 05, 2019

Melody Y. Guan, Gregory Valiant

Figure 1 for A Surprising Density of Illusionable Natural Speech

Figure 2 for A Surprising Density of Illusionable Natural Speech

Figure 3 for A Surprising Density of Illusionable Natural Speech

Figure 4 for A Surprising Density of Illusionable Natural Speech

Abstract:Recent work on adversarial examples has demonstrated that most natural inputs can be perturbed to fool even state-of-the-art machine learning systems. But does this happen for humans as well? In this work, we investigate: what fraction of natural instances of speech can be turned into "illusions" which either alter humans' perception or result in different people having significantly different perceptions? We first consider the McGurk effect, the phenomenon by which adding a carefully chosen video clip to the audio channel affects the viewer's perception of what is said (McGurk and MacDonald, 1976). We obtain empirical estimates that a significant fraction of both words and sentences occurring in natural speech have some susceptibility to this effect. We also learn models for predicting McGurk illusionability. Finally we demonstrate that the Yanny or Laurel auditory illusion (Pressnitzer et al., 2018) is not an isolated occurrence by generating several very different new instances. We believe that the surprising density of illusionable natural speech warrants further investigation, from the perspectives of both security and cognitive science. Supplementary videos are available at: https://www.youtube.com/playlist?list=PLaX7t1K-e_fF2iaenoKznCatm0RC37B_k.

* 10 pages, 2 figures, 7 tables

Via

Access Paper or Ask Questions

To Trust Or Not To Trust A Classifier

Oct 26, 2018

Heinrich Jiang, Been Kim, Melody Y. Guan, Maya Gupta

Figure 1 for To Trust Or Not To Trust A Classifier

Figure 2 for To Trust Or Not To Trust A Classifier

Figure 3 for To Trust Or Not To Trust A Classifier

Abstract:Knowing when a classifier's prediction can be trusted is useful in many applications and critical for safely using AI. While the bulk of the effort in machine learning research has been towards improving classifier performance, understanding when a classifier's predictions should and should not be trusted has received far less attention. The standard approach is to use the classifier's discriminant or confidence score; however, we show there exists an alternative that is more effective in many situations. We propose a new score, called the trust score, which measures the agreement between the classifier and a modified nearest-neighbor classifier on the testing example. We show empirically that high (low) trust scores produce surprisingly high precision at identifying correctly (incorrectly) classified examples, consistently outperforming the classifier's confidence score as well as many other baselines. Further, under some mild distributional assumptions, we show that if the trust score for an example is high (low), the classifier will likely agree (disagree) with the Bayes-optimal classifier. Our guarantees consist of non-asymptotic rates of statistical consistency under various nonparametric settings and build on recent developments in topological data analysis.

* NIPS 2018

Via

Access Paper or Ask Questions

Efficient Neural Architecture Search via Parameter Sharing

Feb 12, 2018

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean

Figure 1 for Efficient Neural Architecture Search via Parameter Sharing

Figure 2 for Efficient Neural Architecture Search via Parameter Sharing

Figure 3 for Efficient Neural Architecture Search via Parameter Sharing

Figure 4 for Efficient Neural Architecture Search via Parameter Sharing

Abstract:We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Thanks to parameter sharing between child models, ENAS is fast: it delivers strong empirical performances using much fewer GPU-hours than all existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On the Penn Treebank dataset, ENAS discovers a novel architecture that achieves a test perplexity of 55.8, establishing a new state-of-the-art among all methods without post-training processing. On the CIFAR-10 dataset, ENAS designs novel architectures that achieve a test error of 2.89%, which is on par with NASNet (Zoph et al., 2018), whose test error is 2.65%.

Via

Access Paper or Ask Questions

Nonparametric Stochastic Contextual Bandits

Jan 05, 2018

Melody Y. Guan, Heinrich Jiang

Figure 1 for Nonparametric Stochastic Contextual Bandits

Figure 2 for Nonparametric Stochastic Contextual Bandits

Figure 3 for Nonparametric Stochastic Contextual Bandits

Figure 4 for Nonparametric Stochastic Contextual Bandits

Abstract:We analyze the $K$-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of $\widetilde{O}\Big(T^{\frac{1+D}{2+D}}\Big)$, where $D$ is the context dimension, for a modified UCB algorithm that is simple to implement ($k$NN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.

* AAAI 2018

Via

Access Paper or Ask Questions

Who Said What: Modeling Individual Labelers Improves Classification

Jan 04, 2018

Melody Y. Guan, Varun Gulshan, Andrew M. Dai, Geoffrey E. Hinton

Figure 1 for Who Said What: Modeling Individual Labelers Improves Classification

Figure 2 for Who Said What: Modeling Individual Labelers Improves Classification

Figure 3 for Who Said What: Modeling Individual Labelers Improves Classification

Figure 4 for Who Said What: Modeling Individual Labelers Improves Classification

Abstract:Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona (2010), and by Mnih and Hinton (2012). Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training.

* AAAI 2018

Via

Access Paper or Ask Questions

Efficient Attention using a Fixed-Size Memory Representation

Jul 01, 2017

Denny Britz, Melody Y. Guan, Minh-Thang Luong

Figure 1 for Efficient Attention using a Fixed-Size Memory Representation

Figure 2 for Efficient Attention using a Fixed-Size Memory Representation

Figure 3 for Efficient Attention using a Fixed-Size Memory Representation

Figure 4 for Efficient Attention using a Fixed-Size Memory Representation

Abstract:The standard content-based attention mechanism typically used in sequence-to-sequence models is computationally expensive as it requires the comparison of large encoder and decoder states at each time step. In this work, we propose an alternative attention mechanism based on a fixed size memory representation that is more efficient. Our technique predicts a compact set of K attention contexts during encoding and lets the decoder compute an efficient lookup that does not need to consult the memory. We show that our approach performs on-par with the standard attention mechanism while yielding inference speedups of 20% for real-world translation tasks and more for tasks with longer sequences. By visualizing attention scores we demonstrate that our models learn distinct, meaningful alignments.

* EMNLP 2017

Via

Access Paper or Ask Questions