Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wonyoung Kim

Adaptive Data Augmentation for Thompson Sampling

Jun 17, 2025

Wonyoung Kim

Abstract:In linear contextual bandits, the objective is to select actions that maximize cumulative rewards, modeled as a linear function with unknown parameters. Although Thompson Sampling performs well empirically, it does not achieve optimal regret bounds. This paper proposes a nearly minimax optimal Thompson Sampling for linear contextual bandits by developing a novel estimator with the adaptive augmentation and coupling of the hypothetical samples that are designed for efficient parameter learning. The proposed estimator accurately predicts rewards for all arms without relying on assumptions for the context distribution. Empirical results show robust performance and significant improvement over existing methods.

Via

Access Paper or Ask Questions

Linear Bandits with Partially Observable Features

Feb 10, 2025

Wonyoung Kim, Sungwoo Park, Garud Iyengar, Assaf Zeevi, Min-hwan Oh

Abstract:We introduce a novel linear bandit problem with partially observable features, resulting in partial reward information and spurious estimates. Without proper address for latent part, regret possibly grows linearly in decision horizon $T$, as their influence on rewards are unknown. To tackle this, we propose a novel analysis to handle the latent features and an algorithm that achieves sublinear regret. The core of our algorithm involves (i) augmenting basis vectors orthogonal to the observed feature space, and (ii) introducing an efficient doubly robust estimator. Our approach achieves a regret bound of $\tilde{O}(\sqrt{(d + d_h)T})$, where $d$ is the dimension of observed features, and $d_h$ is the unknown dimension of the subspace of the unobserved features. Notably, our algorithm requires no prior knowledge of the unobserved feature space, which may expand as more features become hidden. Numerical experiments confirm that our algorithm outperforms both non-contextual multi-armed bandits and linear bandit algorithms depending solely on observed features.

Via

Access Paper or Ask Questions

A Doubly Robust Approach to Sparse Reinforcement Learning

Oct 23, 2023

Wonyoung Kim, Garud Iyengar, Assaf Zeevi

Figure 1 for A Doubly Robust Approach to Sparse Reinforcement Learning

Figure 2 for A Doubly Robust Approach to Sparse Reinforcement Learning

Figure 3 for A Doubly Robust Approach to Sparse Reinforcement Learning

Figure 4 for A Doubly Robust Approach to Sparse Reinforcement Learning

Abstract:We propose a new regret minimization algorithm for episodic sparse linear Markov decision process (SMDP) where the state-transition distribution is a linear function of observed features. The only previously known algorithm for SMDP requires the knowledge of the sparsity parameter and oracle access to an unknown policy. We overcome these limitations by combining the doubly robust method that allows one to use feature vectors of \emph{all} actions with a novel analysis technique that enables the algorithm to use data from all periods in all episodes. The regret of the proposed algorithm is $\tilde{O}(\sigma^{-1}_{\min} s_{\star} H \sqrt{N})$, where $\sigma_{\min}$ denotes the restrictive the minimum eigenvalue of the average Gram matrix of feature vectors, $s_\star$ is the sparsity parameter, $H$ is the length of an episode, and $N$ is the number of rounds. We provide a lower regret bound that matches the upper bound up to logarithmic factors on a newly identified subclass of SMDPs. Our numerical experiments support our theoretical results and demonstrate the superior performance of our algorithm.

Via

Access Paper or Ask Questions

Pareto Front Identification with Regret Minimization

May 31, 2023

Wonyoung Kim, Garud Iyengar, Assaf Zeevi

Abstract:We consider Pareto front identification for linear bandits (PFILin) where the goal is to identify a set of arms whose reward vectors are not dominated by any of the others when the mean reward vector is a linear function of the context. PFILin includes the best arm identification problem and multi-objective active learning as special cases. The sample complexity of our proposed algorithm is $\tilde{O}(d/\Delta^2)$, where $d$ is the dimension of contexts and $\Delta$ is a measure of problem complexity. Our sample complexity is optimal up to a logarithmic factor. A novel feature of our algorithm is that it uses the contexts of all actions. In addition to efficiently identifying the Pareto front, our algorithm also guarantees $\tilde{O}(\sqrt{d/t})$ bound for instantaneous Pareto regret when the number of samples is larger than $\Omega(d\log dL)$ for $L$ dimensional vector rewards. By using the contexts of all arms, our proposed algorithm simultaneously provides efficient Pareto front identification and regret minimization. Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while minimizing the regret.

* 25 pages including appendix

Via

Access Paper or Ask Questions

Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback

Jan 31, 2023

Wonyoung Kim, Garud Iyengar, Assaf Zeevi

Abstract:We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.

* 42 pages including Appendix

Via

Access Paper or Ask Questions

Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits

Sep 15, 2022

Wonyoung Kim, Kyungbok Lee, Myunghee Cho Paik

Figure 1 for Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits

Figure 2 for Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits

Figure 3 for Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits

Figure 4 for Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits

Abstract:We propose a novel contextual bandit algorithm for generalized linear rewards with an $\tilde{O}(\sqrt{\kappa^{-1} \phi T})$ regret over $T$ rounds where $\phi$ is the minimum eigenvalue of the covariance of contexts and $\kappa$ is a lower bound of the variance of rewards. In several practical cases where $\phi=O(d)$, our result is the first regret bound for generalized linear model (GLM) bandits with the order $\sqrt{d}$ without relying on the approach of Auer [2002]. We achieve this bound using a novel estimator called double doubly-robust (DDR) estimator, a subclass of doubly-robust (DR) estimator but with a tighter error bound. The approach of Auer [2002] achieves independence by discarding the observed rewards, whereas our algorithm achieves independence considering all contexts using our DDR estimator. We also provide an $O(\kappa^{-1} \phi \log (NT) \log T)$ regret bound for $N$ arms under a probabilistic margin condition. Regret bounds under the margin condition are given by Bastani and Bayati [2020] and Bastani et al. [2021] under the setting that contexts are common to all arms but coefficients are arm-specific. When contexts are different for all arms but coefficients are common, ours is the first regret bound under the margin condition for linear models or GLMs. We conduct empirical studies using synthetic data and real examples, demonstrating the effectiveness of our algorithm.

* 33 pages including Appendix

Via

Access Paper or Ask Questions

Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits

Jun 16, 2022

Wonyoung Kim, Min-hwan Oh, Myunghee Cho Paik

Figure 1 for Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits

Figure 2 for Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits

Abstract:We propose a novel algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into additive dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.

* 27 pages including Appendix

Via

Access Paper or Ask Questions

Doubly Robust Thompson Sampling for linear payoffs

Feb 01, 2021

Wonyoung Kim, Gi-soo Kim, Myunghee Cho Paik

Figure 1 for Doubly Robust Thompson Sampling for linear payoffs

Figure 2 for Doubly Robust Thompson Sampling for linear payoffs

Abstract:A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. Since the arm choice depends on the past context and reward pairs, the contexts of chosen arms suffer from correlation and render the analysis difficult. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling (TS) that applies the DR technique used in missing data literature to TS. The proposed algorithm improves the bound of TS by a factor of $\sqrt{d}$, where $d$ is the dimension of the context. A benefit of the proposed method is that it uses all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of TS. Empirical studies show the advantage of the proposed algorithm over TS.

* 18pages including Supplementary Materials

Via

Access Paper or Ask Questions

Principled learning method for Wasserstein distributionally robust optimization with local perturbations

Jun 22, 2020

Yongchan Kwon, Wonyoung Kim, Joong-Ho Won, Myunghee Cho Paik

Figure 1 for Principled learning method for Wasserstein distributionally robust optimization with local perturbations

Figure 2 for Principled learning method for Wasserstein distributionally robust optimization with local perturbations

Figure 3 for Principled learning method for Wasserstein distributionally robust optimization with local perturbations

Figure 4 for Principled learning method for Wasserstein distributionally robust optimization with local perturbations

Abstract:Wasserstein distributionally robust optimization (WDRO) attempts to learn a model that minimizes the local worst-case risk in the vicinity of the empirical data distribution defined by Wasserstein ball. While WDRO has received attention as a promising tool for inference since its introduction, its theoretical understanding has not been fully matured. Gao et al. (2017) proposed a minimizer based on a tractable approximation of the local worst-case risk, but without showing risk consistency. In this paper, we propose a minimizer based on a novel approximation theorem and provide the corresponding risk consistency results. Furthermore, we develop WDRO inference for locally perturbed data that include the Mixup (Zhang et al., 2017) as a special case. We show that our approximation and risk consistency results naturally extend to the cases when data are locally perturbed. Numerical experiments demonstrate robustness of the proposed method using image classification datasets. Our results show that the proposed method achieves significantly higher accuracy than baseline models on noisy datasets.

* Accepted for ICML 2020

Via

Access Paper or Ask Questions

An analytic formulation for positive-unlabeled learning via weighted integral probability metric

Feb 08, 2019

Yongchan Kwon, Wonyoung Kim, Masashi Sugiyama, Myunghee Cho Paik

Figure 1 for An analytic formulation for positive-unlabeled learning via weighted integral probability metric

Figure 2 for An analytic formulation for positive-unlabeled learning via weighted integral probability metric

Figure 3 for An analytic formulation for positive-unlabeled learning via weighted integral probability metric

Figure 4 for An analytic formulation for positive-unlabeled learning via weighted integral probability metric

Abstract:We consider the problem of learning a binary classifier from only positive and unlabeled observations (PU learning). Although recent research in PU learning has succeeded in showing theoretical and empirical performance, most existing algorithms need to solve either a convex or a non-convex optimization problem and thus are not suitable for large-scale datasets. In this paper, we propose a simple yet theoretically grounded PU learning algorithm by extending the previous work proposed for supervised binary classification (Sriperumbudur et al., 2012). The proposed PU learning algorithm produces a closed-form classifier when the hypothesis space is a closed ball in reproducing kernel Hilbert space. In addition, we establish upper bounds of the estimation error and the excess risk. The obtained estimation error bound is sharper than existing results and the excess risk bound does not rely on an approximation error term. To the best of our knowledge, we are the first to explicitly derive the excess risk bound in the field of PU learning. Finally, we conduct extensive numerical experiments using both synthetic and real datasets, demonstrating improved accuracy, scalability, and robustness of the proposed algorithm.

* 37 pages, 10 figures

Via

Access Paper or Ask Questions