University of Texas
Abstract:Bandit algorithms have garnered significant attention due to their practical applications in real-world scenarios. However, beyond simple settings such as multi-arm or linear bandits, optimal algorithms remain scarce. Notably, no optimal solution exists for pure exploration problems in the context of generalized linear model (GLM) bandits. In this paper, we narrow this gap and develop the first track-and-stop algorithm for general pure exploration problems under the logistic bandit called logistic track-and-stop (Log-TS). Log-TS is an efficient algorithm that asymptotically matches an approximation for the instance-specific lower bound of the expected sample complexity up to a logarithmic factor.
Abstract:We investigate active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a mean-zero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. Thus, with sufficiently rapid eigenvalue decay of the covariance kernels, arbitrarily fast error convergence rates can be achieved. This contrasts with the passive (i.i.d.) data collection strategies, where the convergence rate is never faster than $\sim n^{-1}$. In fact, for our setting, we establish a \emph{non-vanishing} lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active over passive data collection strategies in operator learning.
Abstract:We introduce a novel extension of the contextual bandit problem, where new sets of arms can be requested with stochastic time delays and associated costs. In this setting, the learner can select multiple arms from a decision set, with each selection taking one unit of time. The problem is framed as a special case of semi-Markov decision processes (SMDPs). The arm contexts, request times, and costs are assumed to follow an unknown distribution. We consider the regret of an online learning algorithm with respect to the optimal policy that achieves the maximum average reward. By leveraging the Bellman optimality equation, we design algorithms that can effectively select arms and determine the appropriate time to request new arms, thereby minimizing their regret. Under the realizability assumption, we analyze the proposed algorithms and demonstrate that their regret upper bounds align with established results in the contextual bandit literature. We validate the algorithms through experiments on simulated data and a movie recommendation dataset, showing that their performance is consistent with theoretical analyses.
Abstract:We study generation through the lens of statistical learning theory. First, we abstract and formalize the results of Gold [1967], Angluin [1979, 1980], and Kleinberg and Mullainathan [2024] for language identification/generation in the limit in terms of a binary hypothesis class defined over an abstract instance space. Then, we formalize a different paradigm of generation studied by Kleinberg and Mullainathan [2024], which we call ``uniform generation," and provide a characterization of which hypothesis classes are uniformly generatable. As is standard in statistical learning theory, our characterization is in terms of the finiteness of a new combinatorial dimension we call the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and online learnability) and show that these two properties of hypothesis classes are \emph{incompatible} - there are classes that are generatable but not predictable and vice versa.
Abstract:We investigate the problem of learning operators between function spaces, focusing on the linear layer of the Fourier Neural Operator. First, we identify three main errors that occur during the learning process: statistical error due to finite sample size, truncation error from finite rank approximation of the operator, and discretization error from handling functional data on a finite grid of domain points. Finally, we analyze a Discrete Fourier Transform (DFT) based least squares estimator, establishing both upper and lower bounds on the aforementioned errors.
Abstract:Sequential decision-making domains such as recommender systems, healthcare and education often have unobserved heterogeneity in the population that can be modeled using latent bandits $-$ a framework where an unobserved latent state determines the model for a trajectory. While the latent bandit framework is compelling, the extent of its generality is unclear. We first address this by establishing a de Finetti theorem for decision processes, and show that $\textit{every}$ exchangeable and coherent stateless decision process is a latent bandit. The latent bandit framework lends itself particularly well to online learning with offline datasets, a problem of growing interest in sequential decision-making. One can leverage offline latent bandit data to learn a complex model for each latent state, so that an agent can simply learn the latent state online to act optimally. We focus on a linear model for a latent bandit with $d_A$-dimensional actions, where the latent states lie in an unknown $d_K$-dimensional subspace for $d_K \ll d_A$. We present SOLD, a novel principled method to learn this subspace from short offline trajectories with guarantees. We then provide two methods to leverage this subspace online: LOCAL-UCB and ProBALL-UCB. We demonstrate that LOCAL-UCB enjoys $\tilde O(\min(d_A\sqrt{T}, d_K\sqrt{T}(1+\sqrt{d_AT/d_KN})))$ regret guarantees, where the effective dimension is lower when the size $N$ of the offline dataset is larger. ProBALL-UCB enjoys a slightly weaker guarantee, but is more practical and computationally efficient. Finally, we establish the efficacy of our methods using experiments on both synthetic data and real-life movie recommendation data from MovieLens.
Abstract:Uncertainty quantification for multi-view learning is motivated by the increasing use of multi-view data in scientific problems. A common variant of multi-view learning is late fusion: train separate predictors on individual views and combine them after single-view predictions are available. Existing methods for uncertainty quantification for late fusion often rely on undesirable distributional assumptions for validity. Conformal prediction is one approach that avoids such distributional assumptions. However, naively applying conformal prediction to late-stage fusion pipelines often produces overly conservative and uninformative prediction regions, limiting its downstream utility. We propose a novel methodology, Multi-View Conformal Prediction (MVCP), where conformal prediction is instead performed separately on the single-view predictors and only fused subsequently. Our framework extends the standard scalar formulation of a score function to a multivariate score that produces more efficient downstream prediction regions in both classification and regression settings. We then demonstrate that such improvements can be realized in methods built atop conformalized regressors, specifically in robust predict-then-optimize pipelines.
Abstract:We study online classification under smoothed adversaries. In this setting, at each time point, the adversary draws an example from a distribution that has a bounded density with respect to a fixed base measure, which is known apriori to the learner. For binary classification and scalar-valued regression, previous works \citep{haghtalab2020smoothed, block2022smoothed} have shown that smoothed online learning is as easy as learning in the iid batch setting under PAC model. However, we show that smoothed online classification can be harder than the iid batch classification when the label space is unbounded. In particular, we construct a hypothesis class that is learnable in the iid batch setting under the PAC model but is not learnable under the smoothed online model. Finally, we identify a condition that ensures that the PAC learnability of a hypothesis class is sufficient for its smoothed online learnability.
Abstract:We study online classification when the learner has access to predictions about future examples. We design an online learner whose expected regret is never worse than the worst-case regret, gracefully improves with the quality of the predictions, and can be significantly better than the worst-case regret when the predictions of future examples are accurate. As a corollary, we show that if the learner is always guaranteed to observe data where future examples are easily predictable, then online learning can be as easy as transductive online learning. Our results complement recent work in online algorithms with predictions and smoothed online classification, which go beyond a worse-case analysis by using machine-learned predictions and distributional assumptions respectively.
Abstract:Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $\epsilon$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.