Abstract:Rewards serve as a measure of user satisfaction and act as a limiting factor in interactive recommender systems. In this research, we focus on the problem of learning to reward (LTR), which is fundamental to reinforcement learning. Previous approaches either introduce additional procedures for learning to reward, thereby increasing the complexity of optimization, or assume that user-agent interactions provide perfect demonstrations, which is not feasible in practice. Ideally, we aim to employ a unified approach that optimizes both the reward and policy using compositional demonstrations. However, this requirement presents a challenge since rewards inherently quantify user feedback on-policy, while recommender agents approximate off-policy future cumulative valuation. To tackle this challenge, we propose a novel batch inverse reinforcement learning paradigm that achieves the desired properties. Our method utilizes discounted stationary distribution correction to combine LTR and recommender agent evaluation. To fulfill the compositional requirement, we incorporate the concept of pessimism through conservation. Specifically, we modify the vanilla correction using Bellman transformation and enforce KL regularization to constrain consecutive policy updates. We use two real-world datasets which represent two compositional coverage to conduct empirical studies, the results also show that the proposed method relatively improves both effectiveness (2.3\%) and efficiency (11.53\%)
Abstract:Survivor bias in observational data leads the optimization of recommender systems towards local optima. Currently most solutions re-mines existing human-system collaboration patterns to maximize longer-term satisfaction by reinforcement learning. However, from the causal perspective, mitigating survivor effects requires answering a counterfactual problem, which is generally unidentifiable and inestimable. In this work, we propose a neural causal model to achieve counterfactual inference. Specifically, we first build a learnable structural causal model based on its available graphical representations which qualitatively characterizes the preference transitions. Mitigation of the survivor bias is achieved though counterfactual consistency. To identify the consistency, we use the Gumbel-max function as structural constrains. To estimate the consistency, we apply reinforcement optimizations, and use Gumbel-Softmax as a trade-off to get a differentiable function. Both theoretical and empirical studies demonstrate the effectiveness of our solution.
Abstract:Influence Maximization (IM) is the task of selecting a fixed number of seed nodes in a given network to maximize dissemination benefits. Although the research for efficient algorithms has been dedicated recently, it is usually neglected to further explore the graph structure and the objective function inherently. With this motivation, we take the first attempt on the hypergraph-based IM with a novel causal objective. We consider the case that each hypergraph node carries specific attributes with Individual Treatment Effect (ITE), namely the change of potential outcomes before/after infections in a causal inference perspective. In many scenarios, the sum of ITEs of the infected is a more reasonable objective for influence spread, whereas it is difficult to achieve via current IM algorithms. In this paper, we introduce a new algorithm called \textbf{CauIM}. We first recover the ITE of each node with observational data and then conduct a weighted greedy algorithm to maximize the sum of ITEs of the infected. Theoretically, we mainly present the generalized lower bound of influence spread beyond the well-known $(1-\frac{1}{e})$ optimal guarantee and provide the robustness analysis. Empirically, in real-world experiments, we demonstrate the effectiveness and robustness of \textbf{CauIM}. It outperforms the previous IM and randomized methods significantly.