Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zichuan Lin

Learning Versatile Skills with Curriculum Masking

Oct 23, 2024

Yao Tang, Zhihui Xie, Zichuan Lin, Deheng Ye, Shuai Li

Abstract:Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme.

* NeurIPS 2024 poster, 21 pages, 7 figures

Via

Access Paper or Ask Questions

Replay-enhanced Continual Reinforcement Learning

Nov 20, 2023

Tiantian Zhang, Kevin Zehua Shen, Zichuan Lin, Bo Yuan, Xueqian Wang, Xiu Li, Deheng Ye

Figure 1 for Replay-enhanced Continual Reinforcement Learning

Figure 2 for Replay-enhanced Continual Reinforcement Learning

Figure 3 for Replay-enhanced Continual Reinforcement Learning

Figure 4 for Replay-enhanced Continual Reinforcement Learning

Abstract:Replaying past experiences has proven to be a highly effective approach for averting catastrophic forgetting in supervised continual learning. However, some crucial factors are still largely ignored, making it vulnerable to serious failure, when used as a solution to forgetting in continual reinforcement learning, even in the context of perfect memory where all data of previous tasks are accessible in the current task. On the one hand, since most reinforcement learning algorithms are not invariant to the reward scale, the previously well-learned tasks (with high rewards) may appear to be more salient to the current learning process than the current task (with small initial rewards). This causes the agent to concentrate on those salient tasks at the expense of generality on the current task. On the other hand, offline learning on replayed tasks while learning a new task may induce a distributional shift between the dataset and the learned policy on old tasks, resulting in forgetting. In this paper, we introduce RECALL, a replay-enhanced method that greatly improves the plasticity of existing replay-based methods on new tasks while effectively avoiding the recurrence of catastrophic forgetting in continual reinforcement learning. RECALL leverages adaptive normalization on approximate targets and policy distillation on old tasks to enhance generality and stability, respectively. Extensive experiments on the Continual World benchmark show that RECALL performs significantly better than purely perfect memory replay, and achieves comparable or better overall performance against state-of-the-art continual learning methods.

* Accepted by Transactions on Machine Learning Research 2023

Via

Access Paper or Ask Questions

Future-conditioned Unsupervised Pretraining for Decision Transformer

May 26, 2023

Zhihui Xie, Zichuan Lin, Deheng Ye, Qiang Fu, Wei Yang, Shuai Li

Abstract:Recent research in offline reinforcement learning (RL) has demonstrated that return-conditioned supervised learning is a powerful paradigm for decision-making problems. While promising, return conditioning is limited to training data labeled with rewards and therefore faces challenges in learning from unsupervised data. In this work, we aim to utilize generalized future conditioning to enable efficient unsupervised pretraining from reward-free and sub-optimal offline data. We propose Pretrained Decision Transformer (PDT), a conceptually simple approach for unsupervised RL pretraining. PDT leverages future trajectory information as a privileged context to predict actions during training. The ability to make decisions based on both present and future factors enhances PDT's capability for generalization. Besides, this feature can be easily incorporated into a return-conditioned framework for online finetuning, by assigning return values to possible futures and sampling future embeddings based on their respective values. Empirically, PDT outperforms or performs on par with its supervised pretraining counterpart, especially when dealing with sub-optimal data. Further analysis reveals that PDT can extract diverse behaviors from offline data and controllably sample high-return behaviors by online finetuning. Code is available at here.

* 17 pages, 9 figures, ICML 2023

Via

Access Paper or Ask Questions

Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization

Feb 05, 2023

Zichuan Lin, Xiapeng Wu, Mingfei Sun, Deheng Ye, Qiang Fu, Wei Yang, Wei Liu

Abstract:Recent success in Deep Reinforcement Learning (DRL) methods has shown that policy optimization with respect to an off-policy distribution via importance sampling is effective for sample reuse. In this paper, we show that the use of importance sampling could introduce high variance in the objective estimate. Specifically, we show in a principled way that the variance of importance sampling estimate grows quadratically with importance ratios and the large ratios could consequently jeopardize the effectiveness of surrogate objective optimization. We then propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high. We instantiate this sample dropout technique on representative policy optimization algorithms, including TRPO, PPO, and ESPO, and demonstrate that it consistently boosts the performance of those DRL algorithms on both continuous and discrete action controls, including MuJoCo, DMControl and Atari video games. Our code is open-sourced at \url{https://github.com/LinZichuan/sdpo.git}.

Via

Access Paper or Ask Questions

A Survey on Transformers in Reinforcement Learning

Jan 08, 2023

Wenzhe Li, Hao Luo, Zichuan Lin, Chongjie Zhang, Zongqing Lu, Deheng Ye

Abstract:Transformer has been considered the dominating neural architecture in NLP and CV, mostly under a supervised setting. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. Hence, in this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.

Via

Access Paper or Ask Questions

Pretraining in Deep Reinforcement Learning: A Survey

Nov 08, 2022

Zhihui Xie, Zichuan Lin, Junyou Li, Shuai Li, Deheng Ye

Abstract:The past few years have seen rapid progress in combining reinforcement learning (RL) with deep learning. Various breakthroughs ranging from games to robotics have spurred the interest in designing sophisticated RL algorithms and systems. However, the prevailing workflow in RL is to learn tabula rasa, which may incur computational inefficiency. This precludes continuous deployment of RL algorithms and potentially excludes researchers without large-scale computing resources. In many other areas of machine learning, the pretraining paradigm has shown to be effective in acquiring transferable knowledge, which can be utilized for a variety of downstream tasks. Recently, we saw a surge of interest in Pretraining for Deep RL with promising results. However, much of the research has been based on different experimental settings. Due to the nature of RL, pretraining in this field is faced with unique challenges and hence requires new design principles. In this survey, we seek to systematically review existing works in pretraining for deep reinforcement learning, provide a taxonomy of these methods, discuss each sub-field, and bring attention to open problems and future directions.

Via

Access Paper or Ask Questions

Revisiting Discrete Soft Actor-Critic

Sep 22, 2022

Haibin Zhou, Zichuan Lin, Junyou Li, Deheng Ye, Qiang Fu, Wei Yang

Figure 1 for Revisiting Discrete Soft Actor-Critic

Figure 2 for Revisiting Discrete Soft Actor-Critic

Figure 3 for Revisiting Discrete Soft Actor-Critic

Figure 4 for Revisiting Discrete Soft Actor-Critic

Abstract:We study the adaption of soft actor-critic (SAC) from continuous action space to discrete action space. We revisit vanilla SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at:https://github.com/coldsummerday/Revisiting-Discrete-SAC.

Via

Access Paper or Ask Questions

MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Feb 17, 2022

Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang(+12 more)

Figure 1 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Figure 2 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Figure 3 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Figure 4 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Abstract:Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restrictions come at a cost -- increased difficulty. If the barrier for entry is too high, many potential participants are demoralized. With this in mind, we hosted the third edition of the MineRL ObtainDiamond competition, MineRL Diamond 2021, with a separate track in which we permitted any solution to promote the participation of newcomers. With this track and more extensive tutorials and support, we saw an increased number of submissions. The participants of this easier track were able to obtain a diamond, and the participants of the harder track progressed the generalizable solutions in the same task.

* Under review for PMLR volume on NeurIPS 2021 competitions

Via

Access Paper or Ask Questions

JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Dec 07, 2021

Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang

Figure 1 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Figure 2 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Figure 3 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Figure 4 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Abstract:Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.

* The champion solution of NeurIPS 2021 MineRL research competition ( https://www.aicrowd.com/challenges/neurips-2021-minerl-diamond-competition/leaderboards )

Via

Access Paper or Ask Questions

Joint System-Wise Optimization for Pipeline Goal-Oriented Dialog System

Jun 09, 2021

Zichuan Lin, Jing Huang, Bowen Zhou, Xiaodong He, Tengyu Ma

Figure 1 for Joint System-Wise Optimization for Pipeline Goal-Oriented Dialog System

Figure 2 for Joint System-Wise Optimization for Pipeline Goal-Oriented Dialog System

Figure 3 for Joint System-Wise Optimization for Pipeline Goal-Oriented Dialog System

Figure 4 for Joint System-Wise Optimization for Pipeline Goal-Oriented Dialog System

Abstract:Recent work (Takanobu et al., 2020) proposed the system-wise evaluation on dialog systems and found that improvement on individual components (e.g., NLU, policy) in prior work may not necessarily bring benefit to pipeline systems in system-wise evaluation. To improve the system-wise performance, in this paper, we propose new joint system-wise optimization techniques for the pipeline dialog system. First, we propose a new data augmentation approach which automates the labeling process for NLU training. Second, we propose a novel stochastic policy parameterization with Poisson distribution that enables better exploration and offers a principled way to compute policy gradient. Third, we propose a reward bonus to help policy explore successful dialogs. Our approaches outperform the competitive pipeline systems from Takanobu et al. (2020) by big margins of 12% success rate in automatic system-wise evaluation and of 16% success rate in human evaluation on the standard multi-domain benchmark dataset MultiWOZ 2.1, and also outperform the recent state-of-the-art end-to-end trained model from DSTC9.

* 13 pages

Via

Access Paper or Ask Questions