Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Herke van Hoof

Making Universal Policies Universal

Feb 20, 2025

Niklas Höpner, David Kuric, Herke van Hoof

Abstract:The development of a generalist agent capable of solving a wide range of sequential decision-making tasks remains a significant challenge. We address this problem in a cross-agent setup where agents share the same observation space but differ in their action spaces. Our approach builds on the universal policy framework, which decouples policy learning into two stages: a diffusion-based planner that generates observation sequences and an inverse dynamics model that assigns actions to these plans. We propose a method for training the planner on a joint dataset composed of trajectories from all agents. This method offers the benefit of positive transfer by pooling data from different agents, while the primary challenge lies in adapting shared plans to each agent's unique constraints. We evaluate our approach on the BabyAI environment, covering tasks of varying complexity, and demonstrate positive transfer across agents. Additionally, we examine the planner's generalisation ability to unseen agents and compare our method to traditional imitation learning approaches. By training on a pooled dataset from multiple agents, our universal policy achieves an improvement of up to $42.20\%$ in task completion accuracy compared to a policy trained on a dataset from a single agent.

Via

Access Paper or Ask Questions

Bridge the Inference Gaps of Neural Processes via Expectation Maximization

Jan 04, 2025

Qi Wang, Marco Federici, Herke van Hoof

Abstract:The neural process (NP) is a family of computationally efficient models for learning distributions over functions. However, it suffers from under-fitting and shows suboptimal performance in practice. Researchers have primarily focused on incorporating diverse structural inductive biases, \textit{e.g.} attention or convolution, in modeling. The topic of inference suboptimality and an analysis of the NP from the optimization objective perspective has hardly been studied in earlier work. To fix this issue, we propose a surrogate objective of the target log-likelihood of the meta dataset within the expectation maximization framework. The resulting model, referred to as the Self-normalized Importance weighted Neural Process (SI-NP), can learn a more accurate functional prior and has an improvement guarantee concerning the target log-likelihood. Experimental results show the competitive performance of SI-NP over other NPs objectives and illustrate that structural inductive biases, such as attention modules, can also augment our method to achieve SOTA performance. Our code is available at \url{https://github.com/hhq123gogogo/SI_NPs}.

* ICLR2023

Via

Access Paper or Ask Questions

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Aug 08, 2024

Masoud Mansoury, Bamshad Mobasher, Herke van Hoof

Figure 1 for Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Figure 2 for Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Figure 3 for Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Figure 4 for Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Abstract:Exposure bias is a well-known issue in recommender systems where items and suppliers are not equally represented in the recommendation results. This bias becomes particularly problematic over time as a few items are repeatedly over-represented in recommendation lists, leading to a feedback loop that further amplifies this bias. Although extensive research has addressed this issue in model-based or neighborhood-based recommendation algorithms, less attention has been paid to online recommendation models, such as those based on top-K contextual bandits, where recommendation models are dynamically updated with ongoing user feedback. In this paper, we study exposure bias in a class of well-known contextual bandit algorithms known as Linear Cascading Bandits. We analyze these algorithms in their ability to handle exposure bias and provide a fair representation of items in the recommendation results. Our analysis reveals that these algorithms fail to mitigate exposure bias in the long run during the course of ongoing user interactions. We propose an Exposure-Aware reward model that updates the model parameters based on two factors: 1) implicit user feedback and 2) the position of the item in the recommendation list. The proposed model mitigates exposure bias by controlling the utility assigned to the items based on their exposure in the recommendation list. Our experiments with two real-world datasets show that our proposed reward model improves the exposure fairness of the linear cascading bandits over time while maintaining the recommendation accuracy. It also outperforms the current baselines. Finally, we prove a high probability upper regret bound for our proposed model, providing theoretical guarantees for its performance.

Via

Access Paper or Ask Questions

Going Beyond Popularity and Positivity Bias: Correcting for Multifactorial Bias in Recommender Systems

Apr 29, 2024

Jin Huang, Harrie Oosterhuis, Masoud Mansoury, Herke van Hoof, Maarten de Rijke

Figure 1 for Going Beyond Popularity and Positivity Bias: Correcting for Multifactorial Bias in Recommender Systems

Figure 2 for Going Beyond Popularity and Positivity Bias: Correcting for Multifactorial Bias in Recommender Systems

Figure 3 for Going Beyond Popularity and Positivity Bias: Correcting for Multifactorial Bias in Recommender Systems

Figure 4 for Going Beyond Popularity and Positivity Bias: Correcting for Multifactorial Bias in Recommender Systems

Abstract:Two typical forms of bias in user interaction data with recommender systems (RSs) are popularity bias and positivity bias, which manifest themselves as the over-representation of interactions with popular items or items that users prefer, respectively. Debiasing methods aim to mitigate the effect of selection bias on the evaluation and optimization of RSs. However, existing debiasing methods only consider single-factor forms of bias, e.g., only the item (popularity) or only the rating value (positivity). This is in stark contrast with the real world where user selections are generally affected by multiple factors at once. In this work, we consider multifactorial selection bias in RSs. Our focus is on selection bias affected by both item and rating value factors, which is a generalization and combination of popularity and positivity bias. While the concept of multifactorial bias is intuitive, it brings a severe practical challenge as it requires substantially more data for accurate bias estimation. As a solution, we propose smoothing and alternating gradient descent techniques to reduce variance and improve the robustness of its optimization. Our experimental results reveal that, with our proposed techniques, multifactorial bias corrections are more effective and robust than single-factor counterparts on real-world and synthetic datasets.

* SIGIR 2024

Via

Access Paper or Ask Questions

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Mar 22, 2024

Guillermo Infante, David Kuric, Anders Jonsson, Vicenç Gómez, Herke van Hoof

Figure 1 for Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Figure 2 for Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Figure 3 for Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Figure 4 for Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Abstract:Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.

Via

Access Paper or Ask Questions

Hierarchical Reinforcement Learning for Power Network Topology Control

Nov 03, 2023

Blazej Manczak, Jan Viebahn, Herke van Hoof

Abstract:Learning in high-dimensional action spaces is a key challenge in applying reinforcement learning (RL) to real-world systems. In this paper, we study the possibility of controlling power networks using RL methods. Power networks are critical infrastructures that are complex to control. In particular, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Hierarchical reinforcement learning (HRL) represents one approach to address this challenge. More precisely, a HRL framework for power network topology control is proposed. The HRL framework consists of three levels of action abstraction. At the highest level, there is the overall long-term task of power network operation, namely, keeping the power grid state within security constraints at all times, which is decomposed into two temporally extended actions: 'do nothing' versus 'propose a topology change'. At the intermediate level, the action space consists of all controllable substations. Finally, at the lowest level, the action space consists of all configurations of the chosen substation. By employing this HRL framework, several hierarchical power network agents are trained for the IEEE 14-bus network. Whereas at the highest level a purely rule-based policy is still chosen for all agents in this study, at the intermediate level the policy is trained using different state-of-the-art RL algorithms. At the lowest level, either an RL algorithm or a greedy algorithm is used. The performance of the different 3-level agents is compared with standard baseline (RL or greedy) approaches. A key finding is that the 3-level agent that employs RL both at the intermediate and the lowest level outperforms all other agents on the most difficult task. Our code is publicly available.

Via

Access Paper or Ask Questions

Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes

Sep 11, 2023

Tim Bakker, Herke van Hoof, Max Welling

Figure 1 for Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes

Figure 2 for Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes

Figure 3 for Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes

Figure 4 for Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes

Abstract:Pool-based active learning (AL) is a promising technology for increasing data-efficiency of machine learning models. However, surveys show that performance of recent AL methods is very sensitive to the choice of dataset and training setting, making them unsuitable for general application. In order to tackle this problem, the field Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting. In this work, we propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem with an Attentive Conditional Neural Process model. Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives, such as those that do not equally weight the error on all data points. We experimentally verify that our Neural Process model outperforms a variety of baselines in these settings. Finally, our experiments show that our model exhibits a tendency towards improved stability to changing datasets. However, performance is sensitive to choice of classifier and more work is necessary to reduce the performance the gap with the myopic oracle and to improve scalability. We present our work as a proof-of-concept for LAL on nonstandard objectives and hope our analysis and modelling considerations inspire future LAL work.

* Accepted at ECML 2023

Via

Access Paper or Ask Questions

Uncoupled Learning of Differential Stackelberg Equilibria with Commitments

Feb 07, 2023

Robert Loftin, Mustafa Mert Çelikok, Herke van Hoof, Samuel Kaski, Frans A. Oliehoek

Abstract:A natural solution concept for many multiagent settings is the Stackelberg equilibrium, under which a ``leader'' agent selects a strategy that maximizes its own payoff assuming the ``follower'' chooses their best response to this strategy. Recent work has presented asymmetric learning updates that can be shown to converge to the \textit{differential} Stackelberg equilibria of two-player differentiable games. These updates are ``coupled'' in the sense that the leader requires some information about the follower's payoff function. Such coupled learning rules cannot be applied to \textit{ad hoc} interactive learning settings, and can be computationally impractical even in centralized training settings where the follower's payoffs are known. In this work, we present an ``uncoupled'' learning process under which each player's learning update only depends on their observations of the other's behavior. We prove that this process converges to a local Stackelberg equilibrium under similar conditions as previous coupled methods. We conclude with a discussion of the potential applications of our approach to human--AI cooperation and multi-agent reinforcement learning.

Via

Access Paper or Ask Questions

Reusable Options through Gradient-based Meta Learning

Dec 22, 2022

David Kuric, Herke van Hoof

Figure 1 for Reusable Options through Gradient-based Meta Learning

Figure 2 for Reusable Options through Gradient-based Meta Learning

Figure 3 for Reusable Options through Gradient-based Meta Learning

Figure 4 for Reusable Options through Gradient-based Meta Learning

Abstract:Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.

Via

Access Paper or Ask Questions

Exposure-Aware Recommendation using Contextual Bandits

Sep 04, 2022

Masoud Mansoury, Bamshad Mobasher, Herke van Hoof

Figure 1 for Exposure-Aware Recommendation using Contextual Bandits

Figure 2 for Exposure-Aware Recommendation using Contextual Bandits

Figure 3 for Exposure-Aware Recommendation using Contextual Bandits

Figure 4 for Exposure-Aware Recommendation using Contextual Bandits

Abstract:Exposure bias is a well-known issue in recommender systems where items and suppliers are not equally represented in the recommendation results. This is especially problematic when bias is amplified over time as a few items (e.g., popular ones) are repeatedly over-represented in recommendation lists and users' interactions with those items will amplify bias towards those items over time resulting in a feedback loop. This issue has been extensively studied in the literature on model-based or neighborhood-based recommendation algorithms, but less work has been done on online recommendation models, such as those based on top-K contextual bandits, where recommendation models are dynamically updated with ongoing user feedback. In this paper, we study exposure bias in a class of well-known contextual bandit algorithms known as Linear Cascading Bandits. We analyze these algorithms on their ability to handle exposure bias and provide a fair representation for items in the recommendation results. Our analysis reveals that these algorithms tend to amplify exposure disparity among items over time. In particular, we observe that these algorithms do not properly adapt to the feedback provided by the users and frequently recommend certain items even when those items are not selected by users. To mitigate this bias, we propose an Exposure-Aware (EA) reward model that updates the model parameters based on two factors: 1) user feedback (i.e., clicked or not), and 2) position of the item in the recommendation list. This way, the proposed model controls the utility assigned to items based on their exposure in the recommendation list. Extensive experiments on two real-world datasets using three contextual bandit algorithms show that the proposed reward model reduces exposure bias amplification in long run while maintaining the recommendation accuracy.

Via

Access Paper or Ask Questions