Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vianney Perchet

CREST, ENSAE Paris

Multi-Armed Bandits with Minimum Aggregated Revenue Constraints

Oct 14, 2025

Ahmed Ben Yahmed, Hafedh El Ferchichi, Marc Abeille, Vianney Perchet

Abstract:We examine a multi-armed bandit problem with contextual information, where the objective is to ensure that each arm receives a minimum aggregated reward across contexts while simultaneously maximizing the total cumulative reward. This framework captures a broad class of real-world applications where fair revenue allocation is critical and contextual variation is inherent. The cross-context aggregation of minimum reward constraints, while enabling better performance and easier feasibility, introduces significant technical challenges -- particularly the absence of closed-form optimal allocations typically available in standard MAB settings. We design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction. For each algorithm, we derive problem-dependent upper bounds on both regret and constraint violations. Furthermore, we establish a lower bound demonstrating that the dependence on the time horizon in our results is optimal in general and revealing fundamental limitations of the free exploration principle leveraged in prior work.

Via

Access Paper or Ask Questions

Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Feb 08, 2025

Ziyad Benomar, Lorenzo Croissant, Vianney Perchet, Spyros Angelopoulos

Figure 1 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Figure 2 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Figure 3 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Figure 4 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Abstract:One-max search is a classic problem in online decision-making, in which a trader acts on a sequence of revealed prices and accepts one of them irrevocably to maximise its profit. The problem has been studied both in probabilistic and in worst-case settings, notably through competitive analysis, and more recently in learning-augmented settings in which the trader has access to a prediction on the sequence. However, existing approaches either lack smoothness, or do not achieve optimal worst-case guarantees: they do not attain the best possible trade-off between the consistency and the robustness of the algorithm. We close this gap by presenting the first algorithm that simultaneously achieves both of these important objectives. Furthermore, we show how to leverage the obtained smoothness to provide an analysis of one-max search in stochastic learning-augmented settings which capture randomness in both the observed prices and the prediction.

Via

Access Paper or Ask Questions

Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Jan 27, 2025

Ahmed Ben Yahmed, Clément Calauzènes, Vianney Perchet

Figure 1 for Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Figure 2 for Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Figure 3 for Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Abstract:We consider the classical multi-armed bandit problem, but with strategic arms. In this context, each arm is characterized by a bounded support reward distribution and strategically aims to maximize its own utility by potentially retaining a portion of its reward, and disclosing only a fraction of it to the learning agent. This scenario unfolds as a game over $T$ rounds, leading to a competition of objectives between the learning agent, aiming to minimize their regret, and the arms, motivated by the desire to maximize their individual utilities. To address these dynamics, we introduce a new mechanism that establishes an equilibrium wherein each arm behaves truthfully and discloses as much of its rewards as possible. With this mechanism, the agent can attain the second-highest average (true) reward among arms, with a cumulative regret bounded by $O(\log(T)/\Delta)$ (problem-dependent) or $O(\sqrt{T\log(T)})$ (worst-case).

Via

Access Paper or Ask Questions

Improved learning rates in multi-unit uniform price auctions

Jan 17, 2025

Marius Potfer, Dorian Baudry, Hugo Richard, Vianney Perchet, Cheng Wan

Figure 1 for Improved learning rates in multi-unit uniform price auctions

Figure 2 for Improved learning rates in multi-unit uniform price auctions

Abstract:Motivated by the strategic participation of electricity producers in electricity day-ahead market, we study the problem of online learning in repeated multi-unit uniform price auctions focusing on the adversarial opposing bid setting. The main contribution of this paper is the introduction of a new modeling of the bid space. Indeed, we prove that a learning algorithm leveraging the structure of this problem achieves a regret of $\tilde{O}(K^{4/3}T^{2/3})$ under bandit feedback, improving over the bound of $\tilde{O}(K^{7/4}T^{3/4})$ previously obtained in the literature. This improved regret rate is tight up to logarithmic terms. Inspired by electricity reserve markets, we further introduce a different feedback model under which all winning bids are revealed. This feedback interpolates between the full-information and bandit scenarios depending on the auctions' results. We prove that, under this feedback, the algorithm that we propose achieves regret $\tilde{O}(K^{5/2}\sqrt{T})$.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Stable Matching with Ties: Approximation Ratios and Learning

Nov 05, 2024

Shiyun Lin, Simon Mauras, Nadav Merlis, Vianney Perchet

Abstract:We study the problem of matching markets with ties, where one side of the market does not necessarily have strict preferences over members at its other side. For example, workers do not always have strict preferences over jobs, students can give the same ranking for different schools and more. In particular, assume w.l.o.g. that workers' preferences are determined by their utility from being matched to each job, which might admit ties. Notably, in contrast to classical two-sided markets with strict preferences, there is no longer a single stable matching that simultaneously maximizes the utility for all workers. We aim to guarantee each worker the largest possible share from the utility in her best possible stable matching. We call the ratio between the worker's best possible stable utility and its assigned utility the \emph{Optimal Stable Share} (OSS)-ratio. We first prove that distributions over stable matchings cannot guarantee an OSS-ratio that is sublinear in the number of workers. Instead, randomizing over possibly non-stable matchings, we show how to achieve a tight logarithmic OSS-ratio. Then, we analyze the case where the real utility is not necessarily known and can only be approximated. In particular, we provide an algorithm that guarantees a similar fraction of the utility compared to the best possible utility. Finally, we move to a bandit setting, where we select a matching at each round and only observe the utilities for matches we perform. We show how to utilize our results for approximate utilities to gracefully interpolate between problems without ties and problems with statistical ties (small suboptimality gaps).

Via

Access Paper or Ask Questions

Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Aug 30, 2024

Ahmed Ben Yahmed, Clément Calauzènes, Vianney Perchet

Figure 1 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Figure 2 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Figure 3 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Figure 4 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Abstract:In the strategic multi-armed bandit setting, when arms possess perfect information about the player's behavior, they can establish an equilibrium where: 1. they retain almost all of their value, 2. they leave the player with a substantial (linear) regret. This study illustrates that, even if complete information is not publicly available to all arms but is shared among them, it is possible to achieve a similar equilibrium. The primary challenge lies in designing a communication protocol that incentivizes the arms to communicate truthfully.

* ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp.7435-7439

Via

Access Paper or Ask Questions

Improved Algorithms for Contextual Dynamic Pricing

Jun 17, 2024

Matilde Tullii, Solenne Gaucher, Nadav Merlis, Vianney Perchet

Figure 1 for Improved Algorithms for Contextual Dynamic Pricing

Abstract:In contextual dynamic pricing, a seller sequentially prices goods based on contextual information. Buyers will purchase products only if the prices are below their valuations. The goal of the seller is to design a pricing strategy that collects as much revenue as possible. We focus on two different valuation models. The first assumes that valuations linearly depend on the context and are further distorted by noise. Under minor regularity assumptions, our algorithm achieves an optimal regret bound of $\tilde{\mathcal{O}}(T^{2/3})$, improving the existing results. The second model removes the linearity assumption, requiring only that the expected buyer valuation is $\beta$-H\"older in the context. For this model, our algorithm obtains a regret $\tilde{\mathcal{O}}(T^{d+2\beta/d+3\beta})$, where $d$ is the dimension of the context space.

Via

Access Paper or Ask Questions

Non-clairvoyant Scheduling with Partial Predictions

May 02, 2024

Ziyad Benomar, Vianney Perchet

Abstract:The non-clairvoyant scheduling problem has gained new interest within learning-augmented algorithms, where the decision-maker is equipped with predictions without any quality guarantees. In practical settings, access to predictions may be reduced to specific instances, due to cost or data limitations. Our investigation focuses on scenarios where predictions for only $B$ job sizes out of $n$ are available to the algorithm. We first establish near-optimal lower bounds and algorithms in the case of perfect predictions. Subsequently, we present a learning-augmented algorithm satisfying the robustness, consistency, and smoothness criteria, and revealing a novel tradeoff between consistency and smoothness inherent in the scenario with a restricted number of predictions.

Via

Access Paper or Ask Questions

The Value of Reward Lookahead in Reinforcement Learning

Mar 18, 2024

Nadav Merlis, Dorian Baudry, Vianney Perchet

Abstract:In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.

Via

Access Paper or Ask Questions

Mode Estimation with Partial Feedback

Feb 20, 2024

Charles Arnal, Vivien Cabannes, Vianney Perchet

Figure 1 for Mode Estimation with Partial Feedback

Abstract:The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize core aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution using partial feedback. We show how entropy coding allows for optimal information acquisition from partial feedback, develop coarse sufficient statistics for mode identification, and adapt bandit algorithms to our new setting. Finally, we combine those contributions into a statistically and computationally efficient solution to our problem.

Via

Access Paper or Ask Questions