Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiayu Yao

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Jun 11, 2025

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song(+1 more)

Abstract:Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT

Via

Access Paper or Ask Questions

CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Dec 11, 2024

Aishwarya Mandyam, Shengpu Tang, Jiayu Yao, Jenna Wiens, Barbara E. Engelhardt

Figure 1 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Figure 2 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Figure 3 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Figure 4 for CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Abstract:Off-policy evaluation (OPE) provides safety guarantees by estimating the performance of a policy before deployment. Recent work introduced IS+, an importance sampling (IS) estimator that uses expert-annotated counterfactual samples to improve behavior dataset coverage. However, IS estimators are known to have high variance; furthermore, the performance of IS+ deteriorates when annotations are imperfect. In this work, we propose a family of OPE estimators inspired by the doubly robust (DR) principle. A DR estimator combines IS with a reward model estimate, known as the direct method (DM), and offers favorable statistical guarantees. We propose three strategies for incorporating counterfactual annotations into a DR-inspired estimator and analyze their properties under various realistic settings. We prove that using imperfect annotations in the DM part of the estimator best leverages the annotations, as opposed to using them in the IS part. To support our theoretical findings, we evaluate the proposed estimators in three contextual bandit environments. Our empirical results show that when the reward model is misspecified and the annotations are imperfect, it is most beneficial to use the annotations only in the DM portion of a DR estimator. Based on these theoretical and empirical insights, we provide a practical guide for using counterfactual annotations in different realistic settings.

Via

Access Paper or Ask Questions

Contextual Conservative Q-Learning for Offline Reinforcement Learning

Jan 16, 2023

Ke Jiang, Jiayu Yao, Xiaoyang Tan

Abstract:Offline reinforcement learning learns an effective policy on offline datasets without online interaction, and it attracts persistent research attention due to its potential of practical application. However, extrapolation error generated by distribution shift will still lead to the overestimation for those actions that transit to out-of-distribution(OOD) states, which degrades the reliability and robustness of the offline policy. In this paper, we propose Contextual Conservative Q-Learning(C-CQL) to learn a robustly reliable policy through the contextual information captured via an inverse dynamics model. With the supervision of the inverse dynamics model, it tends to learn a policy that generates stable transition at perturbed states, for the fact that pertuebed states are a common kind of OOD states. In this manner, we enable the learnt policy more likely to generate transition that destines to the empirical next state distributions of the offline dataset, i.e., robustly reliable transition. Besides, we theoretically reveal that C-CQL is the generalization of the Conservative Q-Learning(CQL) and aggressive State Deviation Correction(SDC). Finally, experimental results demonstrate the proposed C-CQL achieves the state-of-the-art performance in most environments of offline Mujoco suite and a noisy Mujoco setting.

* the work is not finished

Via

Access Paper or Ask Questions

An Empirical Analysis of the Advantages of Finite- v.s. Infinite-Width Bayesian Neural Networks

Nov 28, 2022

Jiayu Yao, Yaniv Yacoby, Beau Coker, Weiwei Pan, Finale Doshi-Velez

Abstract:Comparing Bayesian neural networks (BNNs) with different widths is challenging because, as the width increases, multiple model properties change simultaneously, and, inference in the finite-width case is intractable. In this work, we empirically compare finite- and infinite-width BNNs, and provide quantitative and qualitative explanations for their performance difference. We find that when the model is mis-specified, increasing width can hurt BNN performance. In these cases, we provide evidence that finite-width BNNs generalize better partially due to the properties of their frequency spectrum that allows them to adapt under model mismatch.

Via

Access Paper or Ask Questions

Deep Semi-supervised Learning with Double-Contrast of Features and Semantics

Nov 28, 2022

Quan Feng, Jiayu Yao, Zhison Pan, Guojun Zhou

Abstract:In recent years, the field of intelligent transportation systems (ITS) has achieved remarkable success, which is mainly due to the large amount of available annotation data. However, obtaining these annotated data has to afford expensive costs in reality. Therefore, a more realistic strategy is to leverage semi-supervised learning (SSL) with a small amount of labeled data and a large amount of unlabeled data. Typically, semantic consistency regularization and the two-stage learning methods of decoupling feature extraction and classification have been proven effective. Nevertheless, representation learning only limited to semantic consistency regularization may not guarantee the separation or discriminability of representations of samples with different semantics; due to the inherent limitations of the two-stage learning methods, the extracted features may not match the specific downstream tasks. In order to deal with the above drawbacks, this paper proposes an end-to-end deep semi-supervised learning double contrast of semantic and feature, which extracts effective tasks specific discriminative features by contrasting the semantics/features of positive and negative augmented samples pairs. Moreover, we leverage information theory to explain the rationality of double contrast of semantics and features and slack mutual information to contrastive loss in a simpler way. Finally, the effectiveness of our method is verified in benchmark datasets.

Via

Access Paper or Ask Questions

Success of Uncertainty-Aware Deep Models Depends on Data Manifold Geometry

Aug 05, 2022

Mark Penrod, Harrison Termotto, Varshini Reddy, Jiayu Yao, Finale Doshi-Velez, Weiwei Pan

Figure 1 for Success of Uncertainty-Aware Deep Models Depends on Data Manifold Geometry

Figure 2 for Success of Uncertainty-Aware Deep Models Depends on Data Manifold Geometry

Figure 3 for Success of Uncertainty-Aware Deep Models Depends on Data Manifold Geometry

Figure 4 for Success of Uncertainty-Aware Deep Models Depends on Data Manifold Geometry

Abstract:For responsible decision making in safety-critical settings, machine learning models must effectively detect and process edge-case data. Although existing works show that predictive uncertainty is useful for these tasks, it is not evident from literature which uncertainty-aware models are best suited for a given dataset. Thus, we compare six uncertainty-aware deep learning models on a set of edge-case tasks: robustness to adversarial attacks as well as out-of-distribution and adversarial detection. We find that the geometry of the data sub-manifold is an important factor in determining the success of various models. Our finding suggests an interesting direction in the study of uncertainty-aware deep learning models.

* International Conference on Machine Learning. PMLR 162 (2022)

Via

Access Paper or Ask Questions

Policy Optimization with Sparse Global Contrastive Explanations

Jul 13, 2022

Jiayu Yao, Sonali Parbhoo, Weiwei Pan, Finale Doshi-Velez

Figure 1 for Policy Optimization with Sparse Global Contrastive Explanations

Figure 2 for Policy Optimization with Sparse Global Contrastive Explanations

Figure 3 for Policy Optimization with Sparse Global Contrastive Explanations

Figure 4 for Policy Optimization with Sparse Global Contrastive Explanations

Abstract:We develop a Reinforcement Learning (RL) framework for improving an existing behavior policy via sparse, user-interpretable changes. Our goal is to make minimal changes while gaining as much benefit as possible. We define a minimal change as having a sparse, global contrastive explanation between the original and proposed policy. We improve the current policy with the constraint of keeping that global contrastive explanation short. We demonstrate our framework with a discrete MDP and a continuous 2D navigation domain.

* Accepted at IMLH Workshop, ICML 2022

Via

Access Paper or Ask Questions

Learning Downstream Task by Selectively Capturing Complementary Knowledge from Multiple Self-supervisedly Learning Pretexts

Apr 11, 2022

Quan Feng, Qingyuan Wu, Jiayu Yao, Songcan Chen

Figure 1 for Learning Downstream Task by Selectively Capturing Complementary Knowledge from Multiple Self-supervisedly Learning Pretexts

Figure 2 for Learning Downstream Task by Selectively Capturing Complementary Knowledge from Multiple Self-supervisedly Learning Pretexts

Figure 3 for Learning Downstream Task by Selectively Capturing Complementary Knowledge from Multiple Self-supervisedly Learning Pretexts

Figure 4 for Learning Downstream Task by Selectively Capturing Complementary Knowledge from Multiple Self-supervisedly Learning Pretexts

Abstract:Self-supervised learning (SSL), as a newly emerging unsupervised representation learning paradigm, generally follows a two-stage learning pipeline: 1) learning invariant and discriminative representations with auto-annotation pretext(s), then 2) transferring the representations to assist downstream task(s). Such two stages are usually implemented separately, making the learned representation learned agnostic to the downstream tasks. Currently, most works are devoted to exploring the first stage. Whereas, it is less studied on how to learn downstream tasks with limited labeled data using the already learned representations. Especially, it is crucial and challenging to selectively utilize the complementary representations from diverse pretexts for a downstream task. In this paper, we technically propose a novel solution by leveraging the attention mechanism to adaptively squeeze suitable representations for the tasks. Meanwhile, resorting to information theory, we theoretically prove that gathering representation from diverse pretexts is more effective than a single one. Extensive experiments validate that our scheme significantly exceeds current popular pretext-matching based methods in gathering knowledge and relieving negative transfer in downstream tasks.

Via

Access Paper or Ask Questions

Power-Constrained Bandits

Apr 13, 2020

Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, Finale Doshi-Velez

Abstract:Contextual bandits often provide simple and effective personalization in decision making problems, making them popular in many domains including digital health. However, when bandits are deployed in the context of a scientific study, the aim is not only to personalize for an individual, but also to determine, with sufficient statistical power, whether or not the system's intervention is effective. In this work, we develop a set of constraints and a general meta-algorithm that can be used to both guarantee power constraints and minimize regret. Our results demonstrate a number of existing algorithms can be easily modified to satisfy the constraint without significant decrease in average return. We also show that our modification is also robust to a variety of model mis-specifications.

Via

Access Paper or Ask Questions

Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Jun 24, 2019

Jiayu Yao, Weiwei Pan, Soumya Ghosh, Finale Doshi-Velez

Figure 1 for Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Figure 2 for Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Figure 3 for Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Figure 4 for Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Abstract:Bayesian Neural Networks (BNNs) place priors over the parameters in a neural network. Inference in BNNs, however, is difficult; all inference methods for BNNs are approximate. In this work, we empirically compare the quality of predictive uncertainty estimates for 10 common inference methods on both regression and classification tasks. Our experiments demonstrate that commonly used metrics (e.g. test log-likelihood) can be misleading. Our experiments also indicate that inference innovations designed to capture structure in the posterior do not necessarily produce high quality posterior approximations.

* Accepted to ICML UDL 2019

Via

Access Paper or Ask Questions