Abstract:While machine learning methods excel at pattern recognition, they struggle with complex reasoning tasks in a scalable, algorithmic manner. Recent Deep Thinking methods show promise in learning algorithms that extrapolate: learning in smaller environments and executing the learned algorithm in larger environments. However, these works are limited to symmetrical tasks, where the input and output dimensionalities are the same. To address this gap, we propose NeuralThink, a new recurrent architecture that can consistently extrapolate to both symmetrical and asymmetrical tasks, where the dimensionality of the input and output are different. We contribute with a novel benchmark of asymmetrical tasks for extrapolation. We show that NeuralThink consistently outperforms the prior state-of-the-art Deep Thinking architectures, in regards to stable extrapolation to large observations from smaller training sizes.
Abstract:This paper introduces a formal definition of the setting of ad hoc teamwork under partial observability and proposes a first-principled model-based approach which relies only on prior knowledge and partial observations of the environment in order to perform ad hoc teamwork. We make three distinct assumptions that set it apart previous works, namely: i) the state of the environment is always partially observable, ii) the actions of the teammates are always unavailable to the ad hoc agent and iii) the ad hoc agent has no access to a reward signal which could be used to learn the task from scratch. Our results in 70 POMDPs from 11 domains show that our approach is not only effective in assisting unknown teammates in solving unknown tasks but is also robust in scaling to more challenging problems.
Abstract:We study the convergence of $Q$-learning with linear function approximation. Our key contribution is the introduction of a novel multi-Bellman operator that extends the traditional Bellman operator. By exploring the properties of this operator, we identify conditions under which the projected multi-Bellman operator becomes contractive, providing improved fixed-point guarantees compared to the Bellman operator. To leverage these insights, we propose the multi $Q$-learning algorithm with linear function approximation. We demonstrate that this algorithm converges to the fixed-point of the projected multi-Bellman operator, yielding solutions of arbitrary accuracy. Finally, we validate our approach by applying it to well-known environments, showcasing the effectiveness and applicability of our findings.
Abstract:We study the problem of teaching via demonstrations in sequential decision-making tasks. In particular, we focus on the situation when the teacher has no access to the learner's model and policy, and the feedback from the learner is limited to trajectories that start from states selected by the teacher. The necessity to select the starting states and infer the learner's policy creates an opportunity for using the methods of inverse reinforcement learning and active learning by the teacher. In this work, we formalize the teaching process with limited feedback and propose an algorithm that solves this teaching problem. The algorithm uses a modified version of the active value-at-risk method to select the starting states, a modified maximum causal entropy algorithm to infer the policy, and the difficulty score ratio method to choose the teaching demonstrations. We test the algorithm in a synthetic car driving environment and conclude that the proposed algorithm is an effective solution when the learner's feedback is limited.
Abstract:This work proposes a novel model-free Reinforcement Learning (RL) agent that is able to learn how to complete an unknown task having access to only a part of the input observation. We take inspiration from the concepts of visual attention and active perception that are characteristic of humans and tried to apply them to our agent, creating a hard attention mechanism. In this mechanism, the model decides first which region of the input image it should look at, and only after that it has access to the pixels of that region. Current RL agents do not follow this principle and we have not seen these mechanisms applied to the same purpose as this work. In our architecture, we adapt an existing model called recurrent attention model (RAM) and combine it with the proximal policy optimization (PPO) algorithm. We investigate whether a model with these characteristics is capable of achieving similar performance to state-of-the-art model-free RL agents that access the full input observation. This analysis is made in two Atari games, Pong and SpaceInvaders, which have a discrete action space, and in CarRacing, which has a continuous action space. Besides assessing its performance, we also analyze the movement of the attention of our model and compare it with what would be an example of the human behavior. Even with such visual limitation, we show that our model matches the performance of PPO+LSTM in two of the three games tested.
Abstract:We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully perform cooperative tasks with any communication level at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized). To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly models a communication process between the agents. We contribute MARO, an approach that combines an autoregressive predictive model to estimate missing agents' observations, and a dropout-based RL training scheme that simulates different communication levels during the centralized training phase. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms baselines, allowing agents to act with faulty communication while successfully exploiting shared information.
Abstract:In this paper we investigate the notion of legibility in sequential decision tasks under uncertainty. Previous works that extend legibility to scenarios beyond robot motion either focus on deterministic settings or are computationally too expensive. Our proposed approach, dubbed PoL-MDP, is able to handle uncertainty while remaining computationally tractable. We establish the advantages of our approach against state-of-the-art approaches in several simulated scenarios of different complexity. We also showcase the use of our legible policies as demonstrations for an inverse reinforcement learning agent, establishing their superiority against the commonly used demonstrations based on the optimal policy. Finally, we assess the legibility of our computed policies through a user study where people are asked to infer the goal of a mobile robot following a legible policy by observing its actions.
Abstract:Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels. To address it, we present a novel Geometric Multimodal Contrastive (GMC) representation learning method comprised of two main components: i) a two-level architecture consisting of modality-specific base encoder, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality, and a shared projection head, mapping the intermediate representations to a latent representation space; ii) a multimodal contrastive loss function that encourages the geometric alignment of the learned representations. We experimentally demonstrate that GMC representations are semantically rich and achieve state-of-the-art performance with missing modality information on three different learning problems including prediction and reinforcement learning tasks.
Abstract:Collective risk dilemmas (CRDs) are a class of n-player games that represent societal challenges where groups need to coordinate to avoid the risk of a disastrous outcome. Multi-agent systems incurring such dilemmas face difficulties achieving cooperation and often converge to sub-optimal, risk-dominant solutions where everyone defects. In this paper we investigate the consequences of risk diversity in groups of agents learning to play CRDs. We find that risk diversity places new challenges to cooperation that are not observed in homogeneous groups. We show that increasing risk diversity significantly reduces overall cooperation and hinders collective target achievement. It leads to asymmetrical changes in agents' policies -- i.e. the increase in contributions from individuals at high risk is unable to compensate for the decrease in contributions from individuals at low risk -- which overall reduces the total contributions in a population. When comparing RL behaviors to rational individualistic and social behaviors, we find that RL populations converge to fairer contributions among agents. Our results highlight the need for aligning risk perceptions among agents or develop new learning techniques that explicitly account for risk diversity.
Abstract:In this paper, we present a novel Bayesian online prediction algorithm for the problem setting of ad hoc teamwork under partial observability (ATPO), which enables on-the-fly collaboration with unknown teammates performing an unknown task without needing a pre-coordination protocol. Unlike previous works that assume a fully observable state of the environment, ATPO accommodates partial observability, using the agent's observations to identify which task is being performed by the teammates. Our approach assumes neither that the teammate's actions are visible nor an environment reward signal. We evaluate ATPO in three domains -- two modified versions of the Pursuit domain with partial observability and the overcooked domain. Our results show that ATPO is effective and robust in identifying the teammate's task from a large library of possible tasks, efficient at solving it in near-optimal time, and scalable in adapting to increasingly larger problem sizes.