Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthew Riemer

Handling Delay in Real-Time Reinforcement Learning

Mar 30, 2025

Ivan Anokhin, Rishav Rishav, Matthew Riemer, Stephen Chung, Irina Rish, Samira Ebrahimi Kahou

Abstract:Real-time reinforcement learning (RL) introduces several challenges. First, policies are constrained to a fixed number of actions per second due to hardware limitations. Second, the environment may change while the network is still computing an action, leading to observational delay. The first issue can partly be addressed with pipelining, leading to higher throughput and potentially better policies. However, the second issue remains: if each neuron operates in parallel with an execution time of $\tau$, an $N$-layer feed-forward network experiences observation delay of $\tau N$. Reducing the number of layers can decrease this delay, but at the cost of the network's expressivity. In this work, we explore the trade-off between minimizing delay and network's expressivity. We present a theoretically motivated solution that leverages temporal skip connections combined with history-augmented observations. We evaluate several architectures and show that those incorporating temporal skip connections achieve strong performance across various neuron execution times, reinforcement learning algorithms, and environments, including four Mujoco tasks and all MinAtar games. Moreover, we demonstrate parallel neuron computation can accelerate inference by 6-350% on standard hardware. Our investigation into temporal skip connections and parallel computations paves the way for more efficient RL agents in real-time setting.

* Accepted at ICLR 2025. Code available at https://github.com/avecplezir/realtime-agent

Via

Access Paper or Ask Questions

EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

Feb 20, 2025

Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, Matthew Riemer

Figure 1 for EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

Figure 2 for EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

Figure 3 for EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

Figure 4 for EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

Abstract:Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.

Via

Access Paper or Ask Questions

Can Large Language Models Adapt to Other Agents In-Context?

Dec 27, 2024

Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, Murray Campbell

Abstract:As the research community aims to build better AI assistants that are more dynamic and personalized to the diversity of humans that they interact with, there is increased interest in evaluating the theory of mind capabilities of large language models (LLMs). Indeed, several recent studies suggest that LLM theory of mind capabilities are quite impressive, approximating human-level performance. Our paper aims to rebuke this narrative and argues instead that past studies were not directly measuring agent performance, potentially leading to findings that are illusory in nature as a result. We draw a strong distinction between what we call literal theory of mind i.e. measuring the agent's ability to predict the behavior of others and functional theory of mind i.e. adapting to agents in-context based on a rational response to predictions of their behavior. We find that top performing open source LLMs may display strong capabilities in literal theory of mind, depending on how they are prompted, but seem to struggle with functional theory of mind -- even when partner policies are exceedingly simple. Our work serves to highlight the double sided nature of inductive bias in LLMs when adapting to new situations. While this bias can lead to strong performance over limited horizons, it often hinders convergence to optimal long-term behavior.

Via

Access Paper or Ask Questions

Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference

Dec 18, 2024

Matthew Riemer, Gopeshh Subbaraj, Glen Berseth, Irina Rish

Figure 1 for Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference

Figure 2 for Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference

Figure 3 for Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference

Figure 4 for Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference

Abstract:Realtime environments change even as agents perform action inference and learning, thus requiring high interaction frequencies to effectively minimize regret. However, recent advances in machine learning involve larger neural networks with longer inference times, raising questions about their applicability in realtime systems where reaction time is crucial. We present an analysis of lower bounds on regret in realtime reinforcement learning (RL) environments to show that minimizing long-term regret is generally impossible within the typical sequential interaction and learning paradigm, but often becomes possible when sufficient asynchronous compute is available. We propose novel algorithms for staggering asynchronous inference processes to ensure that actions are taken at consistent time intervals, and demonstrate that use of models with high action inference times is only constrained by the environment's effective stochasticity over the inference horizon, and not by action frequency. Our analysis shows that the number of inference processes needed scales linearly with increasing inference times while enabling use of models that are multiple orders of magnitude larger than existing approaches when learning from a realtime simulation of Game Boy games such as Pok\'emon and Tetris.

Via

Access Paper or Ask Questions

Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Nov 11, 2024

Megh Thakkar, Yash More, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, Sarath Chandar

Figure 1 for Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Figure 2 for Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Figure 3 for Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Figure 4 for Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Abstract:There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called \textsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply \textsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.

Via

Access Paper or Ask Questions

Game-Theoretical Perspectives on Active Equilibria: A Preferred Solution Concept over Nash Equilibria

Oct 28, 2022

Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob N. Foerster, Gerald Tesauro, Jonathan P. How

Abstract:Multiagent learning settings are inherently more difficult than single-agent learning because each agent interacts with other simultaneously learning agents in a shared environment. An effective approach in multiagent reinforcement learning is to consider the learning process of agents and influence their future policies toward desirable behaviors from each agent's perspective. Importantly, if each agent maximizes its long-term rewards by accounting for the impact of its behavior on the set of convergence policies, the resulting multiagent system reaches an active equilibrium. While this new solution concept is general such that standard solution concepts, such as a Nash equilibrium, are special cases of active equilibria, it is unclear when an active equilibrium is a preferred equilibrium over other solution concepts. In this paper, we analyze active equilibria from a game-theoretic perspective by closely studying examples where Nash equilibria are known. By directly comparing active equilibria to Nash equilibria in these examples, we find that active equilibria find more effective solutions than Nash equilibria, concluding that an active equilibrium is the desired solution for multiagent learning settings.

Via

Access Paper or Ask Questions

Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Mar 07, 2022

Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob N. Foerster, Michael Everett, Chuangchuang Sun, Gerald Tesauro, Jonathan P. How

Figure 1 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Figure 2 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Figure 3 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Figure 4 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Abstract:The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other interacting agents and influence the evolution of their future policies towards desirable behavior for its own benefit. Unfortunately, all previous approaches for achieving this suffer from myopic evaluation, considering only a few or a finite number of updates to the policies of other agents. In this paper, we propose a principled framework for considering the limiting policies of other agents as the time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will take on. Thanks to our farsighted evaluation, we demonstrate better long-term performance than state-of-the-art baselines in various domains, including the full spectrum of general-sum, competitive, and cooperative settings.

* Under review as a workshop paper

Via

Access Paper or Ask Questions

Continual Learning In Environments With Polynomial Mixing Times

Dec 13, 2021

Matthew Riemer, Sharath Chandra Raparthy, Ignacio Cases, Gopeshh Subbaraj, Maximilian Puelma Touzel, Irina Rish

Figure 1 for Continual Learning In Environments With Polynomial Mixing Times

Figure 2 for Continual Learning In Environments With Polynomial Mixing Times

Figure 3 for Continual Learning In Environments With Polynomial Mixing Times

Figure 4 for Continual Learning In Environments With Polynomial Mixing Times

Abstract:The mixing time of the Markov chain induced by a policy limits performance in real-world continual learning scenarios. Yet, the effect of mixing times on learning in continual reinforcement learning (RL) remains underexplored. In this paper, we characterize problems that are of long-term interest to the development of continual RL, which we call scalable MDPs, through the lens of mixing times. In particular, we establish that scalable MDPs have mixing times that scale polynomially with the size of the problem. We go on to demonstrate that polynomial mixing times present significant difficulties for existing approaches and propose a family of model-based algorithms that speed up learning by directly optimizing for the average reward through a novel bootstrapping procedure. Finally, we perform empirical regret analysis of our proposed approaches, demonstrating clear improvements over baselines and also how scalable MDPs can be used for analysis of RL algorithms as mixing times scale.

* 2 Figures, 20 pages

Via

Access Paper or Ask Questions

Context-Specific Representation Abstraction for Deep Option Learning

Sep 20, 2021

Marwa Abdulhai, Dong-Ki Kim, Matthew Riemer, Miao Liu, Gerald Tesauro, Jonathan P. How

Figure 1 for Context-Specific Representation Abstraction for Deep Option Learning

Figure 2 for Context-Specific Representation Abstraction for Deep Option Learning

Figure 3 for Context-Specific Representation Abstraction for Deep Option Learning

Figure 4 for Context-Specific Representation Abstraction for Deep Option Learning

Abstract:Hierarchical reinforcement learning has focused on discovering temporally extended actions, such as options, that can provide benefits in problems requiring extensive exploration. One promising approach that learns these options end-to-end is the option-critic (OC) framework. We examine and show in this paper that OC does not decompose a problem into simpler sub-problems, but instead increases the size of the search over policy space with each option considering the entire state space during learning. This issue can result in practical limitations of this method, including sample inefficient learning. To address this problem, we introduce Context-Specific Representation Abstraction for Deep Option Learning (CRADOL), a new framework that considers both temporal abstraction and context-specific representation abstraction to effectively reduce the size of the search over policy space. Specifically, our method learns a factored belief state representation that enables each option to learn a policy over only a subsection of the state space. We test our method against hierarchical, non-hierarchical, and modular recurrent neural network baselines, demonstrating significant sample efficiency improvements in challenging partially observable environments.

Via

Access Paper or Ask Questions

Towards Continual Reinforcement Learning: A Review and Perspectives

Dec 25, 2020

Khimya Khetarpal, Matthew Riemer, Irina Rish, Doina Precup

Figure 1 for Towards Continual Reinforcement Learning: A Review and Perspectives

Figure 2 for Towards Continual Reinforcement Learning: A Review and Perspectives

Figure 3 for Towards Continual Reinforcement Learning: A Review and Perspectives

Figure 4 for Towards Continual Reinforcement Learning: A Review and Perspectives

Abstract:In this article, we aim to provide a literature review of different formulations and approaches to continual reinforcement learning (RL), also known as lifelong or non-stationary RL. We begin by discussing our perspective on why RL is a natural fit for studying continual learning. We then provide a taxonomy of different continual RL formulations and mathematically characterize the non-stationary dynamics of each setting. We go on to discuss evaluation of continual RL agents, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance. Finally, we highlight open problems and challenges in bridging the gap between the current state of continual RL and findings in neuroscience. While still in its early days, the study of continual RL has the promise to develop better incremental reinforcement learners that can function in increasingly realistic applications where non-stationarity plays a vital role. These include applications such as those in the fields of healthcare, education, logistics, and robotics.

* Preprint, 52 pages, 8 figures

Via

Access Paper or Ask Questions