Abstract:In-context learning allows models like transformers to adapt to new tasks from a few examples without updating their weights, a desirable trait for reinforcement learning (RL). However, existing in-context RL methods, such as Algorithm Distillation (AD), demand large, carefully curated datasets and can be unstable and costly to train due to the transient nature of in-context learning abilities. In this work we integrated the n-gram induction heads into transformers for in-context RL. By incorporating these n-gram attention patterns, we significantly reduced the data required for generalization - up to 27 times fewer transitions in the Key-to-Door environment - and eased the training process by making models less sensitive to hyperparameters. Our approach not only matches but often surpasses the performance of AD, demonstrating the potential of n-gram induction heads to enhance the efficiency of in-context RL.
Abstract:Following the success of the in-context learning paradigm in large-scale language and computer vision models, the recently emerging field of in-context reinforcement learning is experiencing a rapid growth. However, its development has been held back by the lack of challenging benchmarks, as all the experiments have been carried out in simple environments and on small-scale datasets. We present \textbf{XLand-100B}, a large-scale dataset for in-context reinforcement learning based on the XLand-MiniGrid environment, as a first step to alleviate this problem. It contains complete learning histories for nearly $30,000$ different tasks, covering $100$B transitions and $2.5$B episodes. It took $50,000$ GPU hours to collect the dataset, which is beyond the reach of most academic labs. Along with the dataset, we provide the utilities to reproduce or expand it even further. With this substantial effort, we aim to democratize research in the rapidly growing field of in-context reinforcement learning and provide a solid foundation for further scaling. The code is open-source and available under Apache 2.0 licence at https://github.com/dunno-lab/xland-minigrid-datasets.
Abstract:Recent work has shown that supervised pre-training on learning histories of RL algorithms results in a model that captures the learning process and is able to improve in-context on novel tasks through interactions with an environment. Despite the progress in this area, there is still a gap in the existing literature, particularly in the in-context generalization to new action spaces. While existing methods show high performance on new tasks created by different reward distributions, their architectural design and training process are not suited for the introduction of new actions during evaluation. We aim to bridge this gap by developing an architecture and training methodology specifically for the task of generalizing to new action spaces. Inspired by Headless LLM, we remove the dependence on the number of actions by directly predicting the action embeddings. Furthermore, we use random embeddings to force the semantic inference of actions from context and to prepare for the new unseen embeddings during test time. Using multi-armed bandit environments with a variable number of arms, we show that our model achieves the performance of the data generation algorithm without requiring retraining for each new environment.
Abstract:In-Context Reinforcement Learning is an emerging field with great potential for advancing Artificial Intelligence. Its core capability lies in generalizing to unseen tasks through interaction with the environment. To master these capabilities, an agent must be trained on specifically curated data that includes a policy improvement that an algorithm seeks to extract and then apply in context in the environment. However, for numerous tasks, training RL agents may be unfeasible, while obtaining human demonstrations can be relatively easy. Additionally, it is rare to be given the optimal policy, typically, only suboptimal demonstrations are available. We propose $AD^{\epsilon}$, a method that leverages demonstrations without policy improvement and enables multi-task in-context learning in the presence of a suboptimal demonstrator. This is achieved by artificially creating a history of incremental improvement, wherein noise is systematically introduced into the demonstrator's policy. Consequently, each successive transition illustrates a marginally better trajectory than the previous one. Our approach was tested on the Dark Room and Dark Key-to-Door environments, resulting in over a $\textbf{2}$x improvement compared to the best available policy in the data.
Abstract:We present XLand-MiniGrid, a suite of tools and grid-world environments for meta-reinforcement learning research inspired by the diversity and depth of XLand and the simplicity and minimalism of MiniGrid. XLand-Minigrid is written in JAX, designed to be highly scalable, and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources. To demonstrate the generality of our library, we have implemented some well-known single-task environments as well as new meta-learning environments capable of generating $10^8$ distinct tasks. We have empirically shown that the proposed environments can scale up to $2^{13}$ parallel instances on the GPU, reaching tens of millions of steps per second.
Abstract:The majority of Multi-Agent Reinforcement Learning (MARL) literature equates the cooperation of self-interested agents in mixed environments to the problem of social welfare maximization, allowing agents to arbitrarily share rewards and private information. This results in agents that forgo their individual goals in favour of social good, which can potentially be exploited by selfish defectors. We argue that cooperation also requires agents' identities and boundaries to be respected by making sure that the emergent behaviour is an equilibrium, i.e., a convention that no agent can deviate from and receive higher individual payoffs. Inspired by advances in mechanism design, we propose to solve the problem of cooperation, defined as finding socially beneficial equilibrium, by using mediators. A mediator is a benevolent entity that may act on behalf of agents, but only for the agents that agree to it. We show how a mediator can be trained alongside agents with policy gradient to maximize social welfare subject to constraints that encourage agents to cooperate through the mediator. Our experiments in matrix and iterative games highlight the potential power of applying mediators in MARL.