Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marco Bagatella

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

May 26, 2025

Leander Diaz-Bone, Marco Bagatella, Jonas Hübotter, Andreas Krause

Abstract:Sparse-reward reinforcement learning (RL) can model a wide range of highly complex tasks. Solving sparse-reward tasks is RL's core premise - requiring efficient exploration coupled with long-horizon credit assignment - and overcoming these challenges is key for building self-improving agents with superhuman ability. We argue that solving complex and high-dimensional tasks requires solving simpler tasks that are relevant to the target task. In contrast, most prior work designs strategies for selecting exploratory tasks with the objective of solving any task, making exploration of challenging high-dimensional, long-horizon tasks intractable. We find that the sense of direction, necessary for effective exploration, can be extracted from existing RL algorithms, without needing any prior information. Based on this finding, we propose a method for directed sparse-reward goal-conditioned very long-horizon RL (DISCOVER), which selects exploratory goals in the direction of the target task. We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent's initial distance to the target, but independent of the volume of the space of all tasks. Empirically, we perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.

Via

Access Paper or Ask Questions

Problem Space Transformations for Generalisation in Behavioural Cloning

Nov 06, 2024

Kiran Doshi, Marco Bagatella, Stelian Coros

Figure 1 for Problem Space Transformations for Generalisation in Behavioural Cloning

Figure 2 for Problem Space Transformations for Generalisation in Behavioural Cloning

Figure 3 for Problem Space Transformations for Generalisation in Behavioural Cloning

Figure 4 for Problem Space Transformations for Generalisation in Behavioural Cloning

Abstract:The combination of behavioural cloning and neural networks has driven significant progress in robotic manipulation. As these algorithms may require a large number of demonstrations for each task of interest, they remain fundamentally inefficient in complex scenarios. This issue is aggravated when the system is treated as a black-box, ignoring its physical properties. This work characterises widespread properties of robotic manipulation, such as pose equivariance and locality. We empirically demonstrate that transformations arising from each of these properties allow neural policies trained with behavioural cloning to better generalise to out-of-distribution problem instances.

Via

Access Paper or Ask Questions

Zero-Shot Offline Imitation Learning via Optimal Transport

Oct 11, 2024

Thomas Rupf, Marco Bagatella, Nico Gürtler, Jonas Frey, Georg Martius

Figure 1 for Zero-Shot Offline Imitation Learning via Optimal Transport

Figure 2 for Zero-Shot Offline Imitation Learning via Optimal Transport

Figure 3 for Zero-Shot Offline Imitation Learning via Optimal Transport

Figure 4 for Zero-Shot Offline Imitation Learning via Optimal Transport

Abstract:Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.

Via

Access Paper or Ask Questions

Active Fine-Tuning of Generalist Policies

Oct 07, 2024

Marco Bagatella, Jonas Hübotter, Georg Martius, Andreas Krause

Figure 1 for Active Fine-Tuning of Generalist Policies

Figure 2 for Active Fine-Tuning of Generalist Policies

Figure 3 for Active Fine-Tuning of Generalist Policies

Figure 4 for Active Fine-Tuning of Generalist Policies

Abstract:Pre-trained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks. This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning. However, as soon as several tasks need to be learned, we must decide which tasks should be demonstrated and how often? We study this multi-task problem and explore an interactive framework in which the agent adaptively selects the tasks to be demonstrated. We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy. We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.

Via

Access Paper or Ask Questions

Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Aug 18, 2024

Marco Bagatella, Andreas Krause, Georg Martius

Figure 1 for Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Figure 2 for Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Figure 3 for Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Figure 4 for Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Abstract:Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning, as it allows describing objectives beyond the expressivity of conventional discounted return formulations. Nonetheless, recent works have shown that LTL formulas can be translated into a variable rewarding and discounting scheme, whose optimization produces a policy maximizing a lower bound on the probability of formula satisfaction. However, the synthesized reward signal remains fundamentally sparse, making exploration challenging. We aim to overcome this limitation, which can prevent current algorithms from scaling beyond low-dimensional, short-horizon problems. We show how better exploration can be achieved by further leveraging the LTL specification and casting its corresponding Limit Deterministic B\"uchi Automaton (LDBA) as a Markov reward process, thus enabling a form of high-level value estimation. By taking a Bayesian perspective over LDBA dynamics and proposing a suitable prior distribution, we show that the values estimated through this procedure can be treated as a shaping potential and mapped to informative intrinsic rewards. Empirically, we demonstrate applications of our method from tabular settings to high-dimensional continuous systems, which have so far represented a significant challenge for LTL-based reinforcement learning algorithms.

Via

Access Paper or Ask Questions

Causal Action Influence Aware Counterfactual Data Augmentation

May 29, 2024

Núria Armengol Urpí, Marco Bagatella, Marin Vlastelica, Georg Martius

Figure 1 for Causal Action Influence Aware Counterfactual Data Augmentation

Figure 2 for Causal Action Influence Aware Counterfactual Data Augmentation

Figure 3 for Causal Action Influence Aware Counterfactual Data Augmentation

Figure 4 for Causal Action Influence Aware Counterfactual Data Augmentation

Abstract:Offline data are both valuable and practical resources for teaching robots complex behaviors. Ideally, learning agents should not be constrained by the scarcity of available demonstrations, but rather generalize beyond the training distribution. However, the complexity of real-world scenarios typically requires huge amounts of data to prevent neural network policies from picking up on spurious correlations and learning non-causal relationships. We propose CAIAC, a data augmentation method that can create feasible synthetic transitions from a fixed dataset without having access to online environment interactions. By utilizing principled methods for quantifying causal influence, we are able to perform counterfactual reasoning by swapping $\it{action}$-unaffected parts of the state-space between independent trajectories in the dataset. We empirically show that this leads to a substantial increase in robustness of offline learning algorithms against distributional shift.

* Accepted in 41st International Conference on Machine Learning (ICML 2024)

Via

Access Paper or Ask Questions

Goal-conditioned Offline Planning from Curious Exploration

Nov 28, 2023

Marco Bagatella, Georg Martius

Figure 1 for Goal-conditioned Offline Planning from Curious Exploration

Figure 2 for Goal-conditioned Offline Planning from Curious Exploration

Figure 3 for Goal-conditioned Offline Planning from Curious Exploration

Figure 4 for Goal-conditioned Offline Planning from Curious Exploration

Abstract:Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goal-conditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments.

Via

Access Paper or Ask Questions

Efficient Learning of High Level Plans from Play

Mar 16, 2023

Núria Armengol Urpí, Marco Bagatella, Otmar Hilliges, Georg Martius, Stelian Coros

Abstract:Real-world robotic manipulation tasks remain an elusive challenge, since they involve both fine-grained environment interaction, as well as the ability to plan for long-horizon goals. Although deep reinforcement learning (RL) methods have shown encouraging results when planning end-to-end in high-dimensional environments, they remain fundamentally limited by poor sample efficiency due to inefficient exploration, and by the complexity of credit assignment over long horizons. In this work, we present Efficient Learning of High-Level Plans from Play (ELF-P), a framework for robotic learning that bridges motion planning and deep RL to achieve long-horizon complex manipulation tasks. We leverage task-agnostic play data to learn a discrete behavioral prior over object-centric primitives, modeling their feasibility given the current context. We then design a high-level goal-conditioned policy which (1) uses primitives as building blocks to scaffold complex long-horizon tasks and (2) leverages the behavioral prior to accelerate learning. We demonstrate that ELF-P has significantly better sample efficiency than relevant baselines over multiple realistic manipulation tasks and learns policies that can be easily transferred to physical hardware.

* Accepted to the International Conference on Robotics and Automation 2023

Via

Access Paper or Ask Questions

TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning

May 26, 2022

Marco Bagatella, Sammy Christen, Otmar Hilliges

Figure 1 for TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning

Figure 2 for TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning

Figure 3 for TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning

Figure 4 for TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning

Abstract:Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features from offline data that are shared by a more diverse range of tasks, such as correlation between actions and directedness. Therefore, we introduce state-independent temporal priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks. Furthermore, we introduce a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior. We compare our approach against strong baselines and provide empirical evidence that it can accelerate reinforcement learning in long-horizon continuous control tasks under sparse reward settings.

Via

Access Paper or Ask Questions

Planning from Pixels in Environments with Combinatorially Hard Search Spaces

Oct 12, 2021

Marco Bagatella, Mirek Olšák, Michal Rolínek, Georg Martius

Figure 1 for Planning from Pixels in Environments with Combinatorially Hard Search Spaces

Figure 2 for Planning from Pixels in Environments with Combinatorially Hard Search Spaces

Figure 3 for Planning from Pixels in Environments with Combinatorially Hard Search Spaces

Figure 4 for Planning from Pixels in Environments with Combinatorially Hard Search Spaces

Abstract:The ability to form complex plans based on raw visual input is a litmus test for current capabilities of artificial intelligence, as it requires a seamless combination of visual processing and abstract algorithmic execution, two traditionally separate areas of computer science. A recent surge of interest in this field brought advances that yield good performance in tasks ranging from arcade games to continuous control; these methods however do not come without significant issues, such as limited generalization capabilities and difficulties when dealing with combinatorially hard planning instances. Our contribution is two-fold: (i) we present a method that learns to represent its environment as a latent graph and leverages state reidentification to reduce the complexity of finding a good policy from exponential to linear (ii) we introduce a set of lightweight environments with an underlying discrete combinatorial structure in which planning is challenging even for humans. Moreover, we show that our methods achieves strong empirical generalization to variations in the environment, even across highly disadvantaged regimes, such as "one-shot" planning, or in an offline RL paradigm which only provides low-quality trajectories.

Via

Access Paper or Ask Questions