Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianhao Wang

Offline Meta Reinforcement Learning with In-Distribution Online Adaptation

Jun 01, 2023

Jianhao Wang, Jin Zhang, Haozhe Jiang, Junyu Zhang, Liwei Wang, Chongjie Zhang

Abstract:Recent offline meta-reinforcement learning (meta-RL) methods typically utilize task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks. To address this problem, we first formally characterize a unique challenge in offline meta-RL: transition-reward distribution shift between offline datasets and online adaptation. Our theory finds that out-of-distribution adaptation episodes may lead to unreliable policy evaluation and that online adaptation with in-distribution episodes can ensure adaptation performance guarantee. Based on these theoretical insights, we propose a novel adaptation framework, called In-Distribution online Adaptation with uncertainty Quantification (IDAQ), which generates in-distribution context using a given uncertainty quantification and performs effective task belief inference to address new tasks. We find a return-based uncertainty quantification for IDAQ that performs effectively. Experiments show that IDAQ achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation.

Via

Access Paper or Ask Questions

Latent-Variable Advantage-Weighted Policy Optimization for Offline RL

Mar 16, 2022

Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Yuan Gao, Jianhao Wang, Wenzhe Li, Bin Liang, Chelsea Finn, Chongjie Zhang

Figure 1 for Latent-Variable Advantage-Weighted Policy Optimization for Offline RL

Figure 2 for Latent-Variable Advantage-Weighted Policy Optimization for Offline RL

Figure 3 for Latent-Variable Advantage-Weighted Policy Optimization for Offline RL

Figure 4 for Latent-Variable Advantage-Weighted Policy Optimization for Offline RL

Abstract:Offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions. This setting is particularly well-suited for continuous control robotic applications for which online data collection based on trial-and-error is costly and potentially unsafe. In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios, such as data from several human demonstrators or from policies that act with different purposes. Unfortunately, such datasets can exacerbate the distribution shift between the behavior policy underlying the data and the optimal policy to be learned, leading to poor performance. To address this challenge, we propose to leverage latent-variable policies that can represent a broader class of policy distributions, leading to better adherence to the training data distribution while maximizing reward via a policy over the latent variable. As we empirically show on a range of simulated locomotion, navigation, and manipulation tasks, our method referred to as latent-variable advantage-weighted policy optimization (LAPO), improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets, and by 8% on datasets with narrow and biased distributions.

Via

Access Paper or Ask Questions

Self-Organized Polynomial-Time Coordination Graphs

Dec 07, 2021

Qianlan Yang, Weijun Dong, Zhizhou Ren, Jianhao Wang, Tonghan Wang, Chongjie Zhang

Figure 1 for Self-Organized Polynomial-Time Coordination Graphs

Figure 2 for Self-Organized Polynomial-Time Coordination Graphs

Figure 3 for Self-Organized Polynomial-Time Coordination Graphs

Figure 4 for Self-Organized Polynomial-Time Coordination Graphs

Abstract:Coordination graph is a promising approach to model agent collaboration in multi-agent reinforcement learning. It factorizes a large multi-agent system into a suite of overlapping groups that represent the underlying coordination dependencies. One critical challenge in this paradigm is the complexity of computing maximum-value actions for a graph-based value factorization. It refers to the decentralized constraint optimization problem (DCOP), which and whose constant-ratio approximation are NP-hard problems. To bypass this fundamental hardness, this paper proposes a novel method, named Self-Organized Polynomial-time Coordination Graphs (SOP-CG), which uses structured graph classes to guarantee the optimality of the induced DCOPs with sufficient function expressiveness. We extend the graph topology to be state-dependent, formulate the graph selection as an imaginary agent, and finally derive an end-to-end learning paradigm from the unified Bellman optimality equation. In experiments, we show that our approach learns interpretable graph topologies, induces effective coordination, and improves performance across a variety of cooperative multi-agent tasks.

Via

Access Paper or Ask Questions

Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration

Nov 22, 2021

Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, Chongjie Zhang

Figure 1 for Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration

Figure 2 for Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration

Figure 3 for Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration

Figure 4 for Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration

Abstract:Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. We leverage an insight of popular factorized MARL algorithms that the "induced" individual Q-values, i.e., the individual utility functions used for local execution, are the embeddings of local action-observation histories, and can capture the interaction between agents due to reward backpropagation during centralized training. Therefore, we use prediction errors of individual Q-values as intrinsic rewards for coordinated exploration and utilize episodic memory to exploit explored informative experience to boost policy training. As the dynamics of an agent's individual Q-value function captures the novelty of states and the influence from other agents, our intrinsic reward can induce coordinated exploration to new or promising states. We illustrate the advantages of our method by didactic examples, and demonstrate its significant outperformance over state-of-the-art MARL baselines on challenging tasks in the StarCraft II micromanagement benchmark.

Via

Access Paper or Ask Questions

LINDA: Multi-Agent Local Information Decomposition for Awareness of Teammates

Oct 15, 2021

Jiahan Cao, Lei Yuan, Jianhao Wang, Shaowei Zhang, Chongjie Zhang, Yang Yu, De-Chuan Zhan

Figure 1 for LINDA: Multi-Agent Local Information Decomposition for Awareness of Teammates

Figure 2 for LINDA: Multi-Agent Local Information Decomposition for Awareness of Teammates

Figure 3 for LINDA: Multi-Agent Local Information Decomposition for Awareness of Teammates

Figure 4 for LINDA: Multi-Agent Local Information Decomposition for Awareness of Teammates

Abstract:In cooperative multi-agent reinforcement learning (MARL), where agents only have access to partial observations, efficiently leveraging local information is critical. During long-time observations, agents can build \textit{awareness} for teammates to alleviate the problem of partial observability. However, previous MARL methods usually neglect this kind of utilization of local information. To address this problem, we propose a novel framework, multi-agent \textit{Local INformation Decomposition for Awareness of teammates} (LINDA), with which agents learn to decompose local information and build awareness for each teammate. We model the awareness as stochastic random variables and perform representation learning to ensure the informativeness of awareness representations by maximizing the mutual information between awareness and the actual trajectory of the corresponding agent. LINDA is agnostic to specific algorithms and can be flexibly integrated to different MARL methods. Sufficient experiments show that the proposed framework learns informative awareness from local partial observations for better collaboration and significantly improves the learning performance, especially on challenging tasks.

Via

Access Paper or Ask Questions

Offline Reinforcement Learning with Reverse Model-based Imagination

Oct 01, 2021

Jianhao Wang, Wenzhe Li, Haozhe Jiang, Guangxiang Zhu, Siyuan Li, Chongjie Zhang

Figure 1 for Offline Reinforcement Learning with Reverse Model-based Imagination

Figure 2 for Offline Reinforcement Learning with Reverse Model-based Imagination

Figure 3 for Offline Reinforcement Learning with Reverse Model-based Imagination

Figure 4 for Offline Reinforcement Learning with Reverse Model-based Imagination

Abstract:In offline reinforcement learning (offline RL), one of the main challenges is to deal with the distributional shift between the learning policy and the given dataset. To address this problem, recent offline RL methods attempt to introduce conservatism bias to encourage learning on high-confidence areas. Model-free approaches directly encode such bias into policy or value function learning using conservative regularizations or special network structures, but their constrained policy search limits the generalization beyond the offline dataset. Model-based approaches learn forward dynamics models with conservatism quantifications and then generate imaginary trajectories to extend the offline datasets. However, due to limited samples in offline dataset, conservatism quantifications often suffer from overgeneralization in out-of-support regions. The unreliable conservative measures will mislead forward model-based imaginations to undesired areas, leading to overaggressive behaviors. To encourage more conservatism, we propose a novel model-based offline RL framework, called Reverse Offline Model-based Imagination (ROMI). We learn a reverse dynamics model in conjunction with a novel reverse policy, which can generate rollouts leading to the target goal states within the offline dataset. These reverse imaginations provide informed data augmentation for the model-free policy learning and enable conservative generalization beyond the offline dataset. ROMI can effectively combine with off-the-shelf model-free algorithms to enable model-based generalization with proper conservatism. Empirical results show that our method can generate more conservative behaviors and achieve state-of-the-art performance on offline RL benchmark tasks.

Via

Access Paper or Ask Questions

Efficient Hierarchical Exploration with Stable Subgoal Representation Learning

May 31, 2021

Siyuan Li, Jin Zhang, Jianhao Wang, Chongjie Zhang

Figure 1 for Efficient Hierarchical Exploration with Stable Subgoal Representation Learning

Figure 2 for Efficient Hierarchical Exploration with Stable Subgoal Representation Learning

Figure 3 for Efficient Hierarchical Exploration with Stable Subgoal Representation Learning

Figure 4 for Efficient Hierarchical Exploration with Stable Subgoal Representation Learning

Abstract:Goal-conditioned hierarchical reinforcement learning (HRL) serves as a successful approach to solving complex and temporally extended tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. However, online subgoal representation learning exacerbates the non-stationary issue of HRL and introduces challenges for exploration in high-level policy learning. In this paper, we propose a state-specific regularization that stabilizes subgoal embeddings in well-explored areas while allowing representation updates in less explored state regions. Benefiting from this stable representation, we design measures of novelty and potential for subgoals, and develop an efficient hierarchical exploration strategy that actively seeks out new promising subgoals and states. Experimental results show that our method significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards and further demonstrate the stability and efficiency of the subgoal representation learning of this work, which promotes superior policy learning.

Via

Access Paper or Ask Questions

QPLEX: Duplex Dueling Multi-Agent Q-Learning

Aug 03, 2020

Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, Chongjie Zhang

Figure 1 for QPLEX: Duplex Dueling Multi-Agent Q-Learning

Figure 2 for QPLEX: Duplex Dueling Multi-Agent Q-Learning

Figure 3 for QPLEX: Duplex Dueling Multi-Agent Q-Learning

Figure 4 for QPLEX: Duplex Dueling Multi-Agent Q-Learning

Abstract:We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE requires the consistency of the optimal joint action selection with optimal individual action selections, which is called the IGM (Individual-Global-Max) principle. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may lead to poor policies or even divergence. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), that takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling architecture transforms the IGM principle to easily realized constraints on advantage functions and thus enables efficient value function learning. Theoretical analysis shows that QPLEX solves a rich class of tasks. Empirical experiments on StarCraft II unit micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline task settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional exploration.

Via

Access Paper or Ask Questions

Towards Understanding Linear Value Decomposition in Cooperative Multi-Agent Q-Learning

Jun 23, 2020

Jianhao Wang, Zhizhou Ren, Beining Han, Chongjie Zhang

Figure 1 for Towards Understanding Linear Value Decomposition in Cooperative Multi-Agent Q-Learning

Figure 2 for Towards Understanding Linear Value Decomposition in Cooperative Multi-Agent Q-Learning

Figure 3 for Towards Understanding Linear Value Decomposition in Cooperative Multi-Agent Q-Learning

Abstract:Value decomposition is a popular and promising approach to scaling up multi-agent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we introduce a variant of the fitted Q-iteration framework for analyzing multi-agent Q-learning with value decomposition. Based on this framework, we derive a closed-form solution to the Bellman error minimization with linear value decomposition. With this novel solution, we further reveal two interesting insights: 1) linear value decomposition implicitly implements a classical multi-agent credit assignment called counterfactual difference rewards; and 2) multi-agent Q-learning with linear value decomposition requires on-policy data distribution to achieve numerical stability. In the empirical study, our experiments demonstrate the realizability of our theoretical implications in a broad set of complicated tasks. They show that most state-of-the-art deep multi-agent Q-learning algorithms using linear value decomposition cannot efficiently utilize off-policy samples, which may even lead to an unbounded divergence.

Via

Access Paper or Ask Questions

Learn to Effectively Explore in Context-Based Meta-RL

Jun 15, 2020

Jin Zhang, Jianhao Wang, Hao Hu, Yingfeng Chen, Changjie Fan, Chongjie Zhang

Figure 1 for Learn to Effectively Explore in Context-Based Meta-RL

Figure 2 for Learn to Effectively Explore in Context-Based Meta-RL

Figure 3 for Learn to Effectively Explore in Context-Based Meta-RL

Figure 4 for Learn to Effectively Explore in Context-Based Meta-RL

Abstract:Meta reinforcement learning (meta-RL) provides a principled approach for fast adaptation to novel tasks by extracting prior knowledge from previous tasks. Under such settings, it is crucial for the agent to perform efficient exploration during adaptation to collect useful experiences. However, existing methods suffer from poor adaptation performance caused by inefficient exploration mechanisms, especially in sparse-reward problems. In this paper, we present a novel off-policy context-based meta-RL approach that efficiently learns a separate exploration policy to support fast adaptation, as well as a context-aware exploitation policy to maximize extrinsic return. The explorer is motivated by an information-theoretical intrinsic reward that encourages the agent to collect experiences that provide rich information about the task. Experiment results on both MuJoCo and Meta-World benchmarks show that our method significantly outperforms baselines by performing efficient exploration strategies.

* Committed to NeurIPS 2020

Via

Access Paper or Ask Questions