Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bo Ding

Pay More Attention to the Robustness of Prompt for Instruction Data Mining

Mar 31, 2025

Qiang Wang, Dawei Feng, Xu Zhang, Ao Shen, Yang Xu, Bo Ding, Huaimin Wang

Abstract:Instruction tuning has emerged as a paramount method for tailoring the behaviors of LLMs. Recent work has unveiled the potential for LLMs to achieve high performance through fine-tuning with a limited quantity of high-quality instruction data. Building upon this approach, we further explore the impact of prompt's robustness on the selection of high-quality instruction data. This paper proposes a pioneering framework of high-quality online instruction data mining for instruction tuning, focusing on the impact of prompt's robustness on the data mining process. Our notable innovation, is to generate the adversarial instruction data by conducting the attack for the prompt of online instruction data. Then, we introduce an Adversarial Instruction-Following Difficulty metric to measure how much help the adversarial instruction data can provide to the generation of the corresponding response. Apart from it, we propose a novel Adversarial Instruction Output Embedding Consistency approach to select high-quality online instruction data. We conduct extensive experiments on two benchmark datasets to assess the performance. The experimental results serve to underscore the effectiveness of our proposed two methods. Moreover, the results underscore the critical practical significance of considering prompt's robustness.

Via

Access Paper or Ask Questions

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Sep 14, 2024

Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, Huaimin Wang

Figure 1 for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Figure 2 for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Figure 3 for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Figure 4 for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Abstract:Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

Via

Access Paper or Ask Questions

Online Self-Preferring Language Models

May 23, 2024

Yuanzhao Zhai, Zhuo Zhang, Kele Xu, Hanyang Peng, Yue Yu, Dawei Feng, Cheng Yang, Bo Ding, Huaimin Wang

Figure 1 for Online Self-Preferring Language Models

Figure 2 for Online Self-Preferring Language Models

Figure 3 for Online Self-Preferring Language Models

Figure 4 for Online Self-Preferring Language Models

Abstract:Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.

* 20 pages, 9 figures

Via

Access Paper or Ask Questions

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles

Dec 30, 2023

Yuanzhao Zhai, Han Zhang, Yu Lei, Yue Yu, Kele Xu, Dawei Feng, Bo Ding, Huaimin Wang

Abstract:Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher rewards leads to a decline in human preferences. In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization. To mitigate this limitation, we scrutinize the RLHF objective in the offline dataset and propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning. To enhance the uncertainty quantification abilities for reward models, we first propose a diverse low-rank adaptation (LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations. Then we optimize policy models utilizing penalized rewards, determined by both rewards and uncertainties provided by the diverse reward LoRA ensembles. Our experimental results, based on two real human preference datasets, showcase the effectiveness of diverse reward LoRA ensembles in quantifying reward uncertainty. Additionally, uncertainty regularization in UP-RLHF proves to be pivotal in mitigating overoptimization, thereby contributing to the overall performance.

* 10 pages, 5 figures,

Via

Access Paper or Ask Questions

MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

Dec 05, 2023

Bo Ding, Zhenfeng Fan, Shuang Yang, Shihong Xia

Abstract:Generating realistic talking faces is an interesting and long-standing topic in the field of computer vision. Although significant progress has been made, it is still challenging to generate high-quality dynamic faces with personalized details. This is mainly due to the inability of the general model to represent personalized details and the generalization problem to unseen controllable parameters. In this work, we propose Myportrait, a simple, general, and flexible framework for neural portrait generation. We incorporate personalized prior in a monocular video and morphable prior in 3D face morphable space for generating personalized details under novel controllable parameters. Our proposed framework supports both video-driven and audio-driven face animation given a monocular video of a single person. Distinguished by whether the test data is sent to training or not, our method provides a real-time online version and a high-quality offline version. Comprehensive experiments in various metrics demonstrate the superior performance of our method over the state-of-the-art methods. The code will be publicly available.

Via

Access Paper or Ask Questions

NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

May 30, 2023

Junpeng Wang, Mengke Ge, Bo Ding, Qi Xu, Song Chen, Yi Kang

Figure 1 for NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

Figure 2 for NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

Figure 3 for NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

Figure 4 for NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

Abstract:With the widespread use of deep neural networks(DNNs) in intelligent systems, DNN accelerators with high performance and energy efficiency are greatly demanded. As one of the feasible processing-in-memory(PIM) architectures, 3D-stacked-DRAM-based PIM(DRAM-PIM) architecture enables large-capacity memory and low-cost memory access, which is a promising solution for DNN accelerators with better performance and energy efficiency. However, the low-cost characteristics of stacked DRAM and the distributed manner of memory access and data storing require us to rebalance the hardware design and DNN mapping. In this paper, we propose NicePIM to efficiently explore the design space of hardware architecture and DNN mapping of DRAM-PIM accelerators, which consists of three key components: PIM-Tuner, PIM-Mapper and Data-Scheduler. PIM-Tuner optimizes the hardware configurations leveraging a DNN model for classifying area-compliant architectures and a deep kernel learning model for identifying better hardware parameters. PIM-Mapper explores a variety of DNN mapping configurations, including parallelism between branches of DNN, DNN layer partitioning, DRAM capacity allocation and data layout pattern in DRAM to generate high-hardware-utilization DNN mapping schemes for various hardware configurations. The Data-Scheduler employs an integer-linear-programming-based data scheduling algorithm to alleviate the inter-PIM-node communication overhead of data-sharing brought by DNN layer partitioning. Experimental results demonstrate that NicePIM can optimize hardware configurations for DRAM-PIM systems effectively and can generate high-quality DNN mapping schemes with latency and energy cost reduced by 37% and 28% on average respectively compared to the baseline method.

* 13 pages

Via

Access Paper or Ask Questions

Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

Aug 24, 2022

Zijian Gao, Kele Xu, HengXing Cai, Yuanzhao Zhai, Dawei Feng, Bo Ding, XinJun Mao, Huaimin Wang

Figure 1 for Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

Figure 2 for Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

Figure 3 for Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

Figure 4 for Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

Abstract:In real-world scenarios, reinforcement learning under sparse-reward synergistic settings has remained challenging, despite surging interests in this field. Previous attempts suggest that intrinsic reward can alleviate the issue caused by sparsity. In this paper, we present a novel intrinsic reward that is inspired by human learning, as humans evaluate curiosity by comparing current observations with historical knowledge. Specifically, we train a self-supervised prediction model and save a set of snapshots of the model parameters, without incurring addition training cost. Then we employ nuclear norm to evaluate the temporal inconsistency between the predictions of different snapshots, which can be further deployed as the intrinsic reward. Moreover, a variational weighting mechanism is proposed to assign weight to different snapshots in an adaptive manner. We demonstrate the efficacy of the proposed method in various benchmark environments. The results suggest that our method can provide overwhelming state-of-the-art performance compared with other intrinsic reward-based methods, without incurring additional training costs and maintaining higher noise tolerance. Our code will be released publicly to enhance reproducibility.

Via

Access Paper or Ask Questions

Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration

Aug 24, 2022

Zijian Gao, Kele Xu, YiYing Li, Yuanzhao Zhai, Dawei Feng, Bo Ding, XinJun Mao, Huaimin Wang

Figure 1 for Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration

Figure 2 for Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration

Figure 3 for Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration

Figure 4 for Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration

Abstract:The sparsity of extrinsic rewards poses a serious challenge for reinforcement learning (RL). Currently, many efforts have been made on curiosity which can provide a representative intrinsic reward for effective exploration. However, the challenge is still far from being solved. In this paper, we present a novel curiosity for RL, named DyMeCu, which stands for Dynamic Memory-based Curiosity. Inspired by human curiosity and information theory, DyMeCu consists of a dynamic memory and dual online learners. The curiosity arouses if memorized information can not deal with the current state, and the information gap between dual learners can be formulated as the intrinsic reward for agents, and then such state information can be consolidated into the dynamic memory. Compared with previous curiosity methods, DyMeCu can better mimic human curiosity with dynamic memory, and the memory module can be dynamically grown based on a bootstrap paradigm with dual learners. On multiple benchmarks including DeepMind Control Suite and Atari Suite, large-scale empirical experiments are conducted and the results demonstrate that DyMeCu outperforms competitive curiosity-based methods with or without extrinsic rewards. We will release the code to enhance reproducibility.

Via

Access Paper or Ask Questions

Nuclear Norm Maximization Based Curiosity-Driven Learning

May 28, 2022

Chao Chen, Zijian Gao, Kele Xu, Sen Yang, Yiying Li, Bo Ding, Dawei Feng, Huaimin Wang

Figure 1 for Nuclear Norm Maximization Based Curiosity-Driven Learning

Figure 2 for Nuclear Norm Maximization Based Curiosity-Driven Learning

Figure 3 for Nuclear Norm Maximization Based Curiosity-Driven Learning

Figure 4 for Nuclear Norm Maximization Based Curiosity-Driven Learning

Abstract:To handle the sparsity of the extrinsic rewards in reinforcement learning, researchers have proposed intrinsic reward which enables the agent to learn the skills that might come in handy for pursuing the rewards in the future, such as encouraging the agent to visit novel states. However, the intrinsic reward can be noisy due to the undesirable environment's stochasticity and directly applying the noisy value predictions to supervise the policy is detrimental to improve the learning performance and efficiency. Moreover, many previous studies employ $\ell^2$ norm or variance to measure the exploration novelty, which will amplify the noise due to the square operation. In this paper, we address aforementioned challenges by proposing a novel curiosity leveraging the nuclear norm maximization (NNM), which can quantify the novelty of exploring the environment more accurately while providing high-tolerance to the noise and outliers. We conduct extensive experiments across a variety of benchmark environments and the results suggest that NNM can provide state-of-the-art performance compared with previous curiosity methods. On 26 Atari games subset, when trained with only intrinsic reward, NNM achieves a human-normalized score of 1.09, which doubles that of competitive intrinsic rewards-based approaches. Our code will be released publicly to enhance the reproducibility.

Via

Access Paper or Ask Questions

KnowSR: Knowledge Sharing among Homogeneous Agents in Multi-agent Reinforcement Learning

May 25, 2021

Zijian Gao, Kele Xu, Bo Ding, Huaimin Wang, Yiying Li, Hongda Jia

Figure 1 for KnowSR: Knowledge Sharing among Homogeneous Agents in Multi-agent Reinforcement Learning

Figure 2 for KnowSR: Knowledge Sharing among Homogeneous Agents in Multi-agent Reinforcement Learning

Figure 3 for KnowSR: Knowledge Sharing among Homogeneous Agents in Multi-agent Reinforcement Learning

Figure 4 for KnowSR: Knowledge Sharing among Homogeneous Agents in Multi-agent Reinforcement Learning

Abstract:Recently, deep reinforcement learning (RL) algorithms have made great progress in multi-agent domain. However, due to characteristics of RL, training for complex tasks would be resource-intensive and time-consuming. To meet this challenge, mutual learning strategy between homogeneous agents is essential, which is under-explored in previous studies, because most existing methods do not consider to use the knowledge of agent models. In this paper, we present an adaptation method of the majority of multi-agent reinforcement learning (MARL) algorithms called KnowSR which takes advantage of the differences in learning between agents. We employ the idea of knowledge distillation (KD) to share knowledge among agents to shorten the training phase. To empirically demonstrate the robustness and effectiveness of KnowSR, we performed extensive experiments on state-of-the-art MARL algorithms in collaborative and competitive scenarios. The results demonstrate that KnowSR outperforms recently reported methodologies, emphasizing the importance of the proposed knowledge sharing for MARL.

Via

Access Paper or Ask Questions