Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Otmane Sakhi

Logarithmic Smoothing for Adaptive PAC-Bayesian Off-Policy Learning

Jun 12, 2025

Maxime Haddouche, Otmane Sakhi

Abstract:Off-policy learning serves as the primary framework for learning optimal policies from logged interactions collected under a static behavior policy. In this work, we investigate the more practical and flexible setting of adaptive off-policy learning, where policies are iteratively refined and re-deployed to collect higher-quality data. Building on the success of PAC-Bayesian learning with Logarithmic Smoothing (LS) in static settings, we extend this framework to the adaptive scenario using tools from online PAC-Bayesian theory. Furthermore, we demonstrate that a principled adjustment to the LS estimator naturally accommodates multiple rounds of deployment and yields faster convergence rates under mild conditions. Our method matches the performance of leading offline approaches in static settings, and significantly outperforms them when intermediate policy deployments are allowed. Empirical evaluations across diverse scenarios highlight both the advantages of adaptive data collection and the strength of the PAC-Bayesian formulation.

Via

Access Paper or Ask Questions

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

May 23, 2024

Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin

Abstract:This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The bound for LS is provably tighter than all its competitors, and naturally results in improved policy selection and learning strategies. Extensive policy evaluation, selection, and learning experiments highlight the versatility and favorable performance of LS.

Via

Access Paper or Ask Questions

Fast Slate Policy Optimization: Going Beyond Plackett-Luce

Aug 03, 2023

Otmane Sakhi, David Rohde, Nicolas Chopin

Abstract:An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.

* Preprint

Via

Access Paper or Ask Questions

PAC-Bayesian Offline Contextual Bandits With Guarantees

Oct 24, 2022

Otmane Sakhi, Nicolas Chopin, Pierre Alquier

Figure 1 for PAC-Bayesian Offline Contextual Bandits With Guarantees

Figure 2 for PAC-Bayesian Offline Contextual Bandits With Guarantees

Figure 3 for PAC-Bayesian Offline Contextual Bandits With Guarantees

Figure 4 for PAC-Bayesian Offline Contextual Bandits With Guarantees

Abstract:This paper introduces a new principled approach for offline policy optimisation in contextual bandits. For two well-established risk estimators, we propose novel generalisation bounds able to confidently improve upon the logging policy offline. Unlike previous work, our approach does not require tuning hyperparameters on held-out sets, and enables deployment with no prior A/B testing. This is achieved by analysing the problem through the PAC-Bayesian lens; mainly, we let go of traditional policy parametrisation (e.g. softmax) and instead interpret the policies as mixtures of deterministic strategies. We demonstrate through extensive experiments evidence of our bounds tightness and the effectiveness of our approach in practical scenarios.

* Paper under review

Via

Access Paper or Ask Questions

Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of Simulation

Sep 18, 2022

Imad Aouali, Amine Benhalloum, Martin Bompaire, Benjamin Heymann, Olivier Jeunen, David Rohde, Otmane Sakhi, Flavian Vasile

Abstract:Both in academic and industry-based research, online evaluation methods are seen as the golden standard for interactive applications like recommendation systems. Naturally, the reason for this is that we can directly measure utility metrics that rely on interventions, being the recommendations that are being shown to users. Nevertheless, online evaluation methods are costly for a number of reasons, and a clear need remains for reliable offline evaluation procedures. In industry, offline metrics are often used as a first-line evaluation to generate promising candidate models to evaluate online. In academic work, limited access to online systems makes offline metrics the de facto approach to validating novel methods. Two classes of offline metrics exist: proxy-based methods, and counterfactual methods. The first class is often poorly correlated with the online metrics we care about, and the latter class only provides theoretical guarantees under assumptions that cannot be fulfilled in real-world environments. Here, we make the case that simulation-based comparisons provide ways forward beyond offline metrics, and argue that they are a preferable means of evaluation.

* Accepted at the ACM RecSys 2021 Workshop on Simulation Methods for Recommender Systems

Via

Access Paper or Ask Questions

Fast Offline Policy Optimization for Large Scale Recommendation

Aug 11, 2022

Otmane Sakhi, David Rohde, Alexandre Gilotte

Figure 1 for Fast Offline Policy Optimization for Large Scale Recommendation

Figure 2 for Fast Offline Policy Optimization for Large Scale Recommendation

Figure 3 for Fast Offline Policy Optimization for Large Scale Recommendation

Figure 4 for Fast Offline Policy Optimization for Large Scale Recommendation

Abstract:Personalised interactive systems such as recommender systems require selecting relevant items dependent on context. Production systems need to identify the items rapidly from very large catalogues which can be efficiently solved using maximum inner product search technology. Offline optimisation of maximum inner product search can be achieved by a relaxation of the discrete problem resulting in policy learning or reinforce style learning algorithms. Unfortunately this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems severely limiting the usefulness of this method in practice. In this paper we show how it is possible to produce an excellent approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.

Via

Access Paper or Ask Questions

A Scalable Probabilistic Model for Reward Optimizing Slate Recommendation

Aug 10, 2022

Imad Aouali, Achraf Ait Sidi Hammou, Sergey Ivanov, Otmane Sakhi, David Rohde, Flavian Vasile

Figure 1 for A Scalable Probabilistic Model for Reward Optimizing Slate Recommendation

Figure 2 for A Scalable Probabilistic Model for Reward Optimizing Slate Recommendation

Figure 3 for A Scalable Probabilistic Model for Reward Optimizing Slate Recommendation

Figure 4 for A Scalable Probabilistic Model for Reward Optimizing Slate Recommendation

Abstract:We introduce Probabilistic Rank and Reward model (PRR), a scalable probabilistic model for personalized slate recommendation. Our model allows state-of-the-art estimation of user interests in the following ubiquitous recommender system scenario: A user is shown a slate of K recommendations and the user chooses at most one of these K items. It is the goal of the recommender system to find the K items of most interest to a user in order to maximize the probability that the user interacts with the slate. Our contribution is to show that we can learn more effectively the probability of the recommendations being successful by combining the reward - whether the slate was clicked or not - and the rank - the item on the slate that was selected. Our method learns more efficiently than bandit methods that use only the reward, and user preference methods that use only the rank. It also provides similar or better estimation performance to independent inverse-propensity-score methods and is far more scalable. Our method is state of the art in terms of both speed and accuracy on massive datasets with up to 1 million items. Finally, our method allows fast delivery of recommendations powered by maximum inner product search (MIPS), making it suitable in extremely low latency domains such as computational advertising.

Via

Access Paper or Ask Questions

Improving Offline Contextual Bandits with Distributional Robustness

Nov 13, 2020

Otmane Sakhi, Louis Faury, Flavian Vasile

Figure 1 for Improving Offline Contextual Bandits with Distributional Robustness

Figure 2 for Improving Offline Contextual Bandits with Distributional Robustness

Figure 3 for Improving Offline Contextual Bandits with Distributional Robustness

Abstract:This paper extends the Distributionally Robust Optimization (DRO) approach for offline contextual bandits. Specifically, we leverage this framework to introduce a convex reformulation of the Counterfactual Risk Minimization principle. Besides relying on convex programs, our approach is compatible with stochastic optimization, and can therefore be readily adapted tothe large data regime. Our approach relies on the construction of asymptotic confidence intervals for offline contextual bandits through the DRO framework. By leveraging known asymptotic results of robust estimators, we also show how to automatically calibrate such confidence intervals, which in turn removes the burden of hyper-parameter selection for policy optimization. We present preliminary empirical results supporting the effectiveness of our approach.

* In Proceedings of the ACM RecSys Workshop on Reinforcement Learning and Robust Estimators for Recommendation Systems (REVEAL 20)

Via

Access Paper or Ask Questions

BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

Aug 28, 2020

Otmane Sakhi, Stephen Bonner, David Rohde, Flavian Vasile

Figure 1 for BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

Figure 2 for BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

Figure 3 for BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

Figure 4 for BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

Abstract:A common task for recommender systems is to build a pro le of the interests of a user from items in their browsing history and later to recommend items to the user from the same catalog. The users' behavior consists of two parts: the sequence of items that they viewed without intervention (the organic part) and the sequences of items recommended to them and their outcome (the bandit part). In this paper, we propose Bayesian Latent Organic Bandit model (BLOB), a probabilistic approach to combine the 'or-ganic' and 'bandit' signals in order to improve the estimation of recommendation quality. The bandit signal is valuable as it gives direct feedback of recommendation performance, but the signal quality is very uneven, as it is highly concentrated on the recommendations deemed optimal by the past version of the recom-mender system. In contrast, the organic signal is typically strong and covers most items, but is not always relevant to the recommendation task. In order to leverage the organic signal to e ciently learn the bandit signal in a Bayesian model we identify three fundamental types of distances, namely action-history, action-action and history-history distances. We implement a scalable approximation of the full model using variational auto-encoders and the local re-paramerization trick. We show using extensive simulation studies that our method out-performs or matches the value of both state-of-the-art organic-based recommendation algorithms, and of bandit-based methods (both value and policy-based) both in organic and bandit-rich environments.

* 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug 2020, San Diego, United States

Via

Access Paper or Ask Questions

Reconsidering Analytical Variational Bounds for Output Layers of Deep Networks

Oct 03, 2019

Otmane Sakhi, Stephen Bonner, David Rohde, Flavian Vasile

Figure 1 for Reconsidering Analytical Variational Bounds for Output Layers of Deep Networks

Figure 2 for Reconsidering Analytical Variational Bounds for Output Layers of Deep Networks

Figure 3 for Reconsidering Analytical Variational Bounds for Output Layers of Deep Networks

Abstract:The combination of the re-parameterization trick with the use of variational auto-encoders has caused a sensation in Bayesian deep learning, allowing the training of realistic generative models of images and has considerably increased our ability to use scalable latent variable models. The re-parameterization trick is necessary for models in which no analytical variational bound is available and allows noisy gradients to be computed for arbitrary models. However, for certain standard output layers of a neural network, analytical bounds are available and the variational auto-encoder may be used both without the re-parameterization trick or the need for any Monte Carlo approximation. In this work, we show that using Jaakola and Jordan bound, we can produce a binary classification layer that allows a Bayesian output layer to be trained, using the standard stochastic gradient descent algorithm. We further demonstrate that a latent variable model utilizing the Bouchard bound for multi-class classification allows for fast training of a fully probabilistic latent factor model, even when the number of classes is very large.

* 8 pages 2 figures

Via

Access Paper or Ask Questions