Abstract:This study introduces a debiasing method for regression estimators, including high-dimensional and nonparametric regression estimators. For example, nonparametric regression methods allow for the estimation of regression functions in a data-driven manner with minimal assumptions; however, these methods typically fail to achieve $\sqrt{n}$-consistency in their convergence rates, and many, including those in machine learning, lack guarantees that their estimators asymptotically follow a normal distribution. To address these challenges, we propose a debiasing technique for nonparametric estimators by adding a bias-correction term to the original estimators, extending the conventional one-step estimator used in semiparametric analysis. Specifically, for each data point, we estimate the conditional expected residual of the original nonparametric estimator, which can, for instance, be computed using kernel (Nadaraya-Watson) regression, and incorporate it as a bias-reduction term. Our theoretical analysis demonstrates that the proposed estimator achieves $\sqrt{n}$-consistency and asymptotic normality under a mild convergence rate condition for both the original nonparametric estimator and the conditional expected residual estimator. Notably, this approach remains model-free as long as the original estimator and the conditional expected residual estimator satisfy the convergence rate condition. The proposed method offers several advantages, including improved estimation accuracy and simplified construction of confidence intervals.
Abstract:This study introduces a doubly robust (DR) estimator for regression discontinuity (RD) designs. In RD designs, treatment effects are estimated in a quasi-experimental setting where treatment assignment depends on whether a running variable surpasses a predefined cutoff. A common approach in RD estimation is to apply nonparametric regression methods, such as local linear regression. In such an approach, the validity relies heavily on the consistency of nonparametric estimators and is limited by the nonparametric convergence rate, thereby preventing $\sqrt{n}$-consistency. To address these issues, we propose the DR-RD estimator, which combines two distinct estimators for the conditional expected outcomes. If either of these estimators is consistent, the treatment effect estimator remains consistent. Furthermore, due to the debiasing effect, our proposed estimator achieves $\sqrt{n}$-consistency if both regression estimators satisfy certain mild conditions, which also simplifies statistical inference.
Abstract:This study explores portfolio selection using predictive models for portfolio returns. Portfolio selection is a fundamental task in finance, and various methods have been developed to achieve this goal. For example, the mean-variance approach constructs portfolios by balancing the trade-off between the mean and variance of asset returns, while the quantile-based approach optimizes portfolios by accounting for tail risk. These traditional methods often rely on distributional information estimated from historical data. However, a key concern is the uncertainty of future portfolio returns, which may not be fully captured by simple reliance on historical data, such as using the sample average. To address this, we propose a framework for predictive portfolio selection using conformal inference, called Conformal Predictive Portfolio Selection (CPPS). Our approach predicts future portfolio returns, computes corresponding prediction intervals, and selects the desirable portfolio based on these intervals. The framework is flexible and can accommodate a variety of predictive models, including autoregressive (AR) models, random forests, and neural networks. We demonstrate the effectiveness of our CPPS framework using an AR model and validate its performance through empirical studies, showing that it provides superior returns compared to simpler strategies.
Abstract:This study investigates a local asymptotic minimax optimal strategy for fixed-budget best arm identification (BAI). We propose the Adaptive Generalized Neyman Allocation (AGNA) strategy and show that its worst-case upper bound of the probability of misidentifying the best arm aligns with the worst-case lower bound under the small-gap regime, where the gap between the expected outcomes of the best and suboptimal arms is small. Our strategy corresponds to a generalization of the Neyman allocation for two-armed bandits (Neyman, 1934; Kaufmann et al., 2016) and a refinement of existing strategies such as the ones proposed by Glynn & Juneja (2004) and Shin et al. (2018). Compared to Komiyama et al. (2022), which proposes a minimax rate-optimal strategy, our proposed strategy has a tighter upper bound that exactly matches the lower bound, including the constant terms, by restricting the class of distributions to the class of small-gap distributions. Our result contributes to the longstanding open issue about the existence of asymptotically optimal strategies in fixed-budget BAI, by presenting the local asymptotic minimax optimal strategy.
Abstract:This study designs an adaptive experiment for efficiently estimating average treatment effect (ATEs). We consider an adaptive experiment where an experimenter sequentially samples an experimental unit from a covariate density decided by the experimenter and assigns a treatment. After assigning a treatment, the experimenter observes the corresponding outcome immediately. At the end of the experiment, the experimenter estimates an ATE using gathered samples. The objective of the experimenter is to estimate the ATE with a smaller asymptotic variance. Existing studies have designed experiments that adaptively optimize the propensity score (treatment-assignment probability). As a generalization of such an approach, we propose a framework under which an experimenter optimizes the covariate density, as well as the propensity score, and find that optimizing both covariate density and propensity score reduces the asymptotic variance more than optimizing only the propensity score. Based on this idea, in each round of our experiment, the experimenter optimizes the covariate density and propensity score based on past observations. To design an adaptive experiment, we first derive the efficient covariate density and propensity score that minimizes the semiparametric efficiency bound, a lower bound for the asymptotic variance given a fixed covariate density and a fixed propensity score. Next, we design an adaptive experiment using the efficient covariate density and propensity score sequentially estimated during the experiment. Lastly, we propose an ATE estimator whose asymptotic variance aligns with the minimized semiparametric efficiency bound.
Abstract:This study considers the linear contextual bandit problem with independent and identically distributed (i.i.d.) contexts. In this problem, existing studies have proposed Best-of-Both-Worlds (BoBW) algorithms whose regrets satisfy $O(\log^2(T))$ for the number of rounds $T$ in a stochastic regime with a suboptimality gap lower-bounded by a positive constant, while satisfying $O(\sqrt{T})$ in an adversarial regime. However, the dependency on $T$ has room for improvement, and the suboptimality-gap assumption can be relaxed. For this issue, this study proposes an algorithm whose regret satisfies $O(\log(T))$ in the setting when the suboptimality gap is lower-bounded. Furthermore, we introduce a margin condition, a milder assumption on the suboptimality gap. That condition characterizes the problem difficulty linked to the suboptimality gap using a parameter $\beta \in (0, \infty]$. We then show that the algorithm's regret satisfies $O\left(\left\{\log(T)\right\}^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$. Here, $\beta= \infty$ corresponds to the case in the existing studies where a lower bound exists in the suboptimality gap, and our regret satisfies $O(\log(T))$ in that case. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $\alpha$-Linear-Contextual (LC)-Tsallis-INF.
Abstract:This study investigates the estimation and the statistical inference about Conditional Average Treatment Effects (CATEs), which have garnered attention as a metric representing individualized causal effects. In our data-generating process, we assume linear models for the outcomes associated with binary treatments and define the CATE as a difference between the expected outcomes of these linear models. This study allows the linear models to be high-dimensional, and our interest lies in consistent estimation and statistical inference for the CATE. In high-dimensional linear regression, one typical approach is to assume sparsity. However, in our study, we do not assume sparsity directly. Instead, we consider sparsity only in the difference of the linear models. We first use a doubly robust estimator to approximate this difference and then regress the difference on covariates with Lasso regularization. Although this regression estimator is consistent for the CATE, we further reduce the bias using the techniques in double/debiased machine learning (DML) and debiased Lasso, leading to $\sqrt{n}$-consistency and confidence intervals. We refer to the debiased estimator as the triple/debiased Lasso (TDL), applying both DML and debiased Lasso techniques. We confirm the soundness of our proposed method through simulation studies.
Abstract:We study a primal-dual reinforcement learning (RL) algorithm for the online constrained Markov decision processes (CMDP) problem, wherein the agent explores an optimal policy that maximizes return while satisfying constraints. Despite its widespread practical use, the existing theoretical literature on primal-dual RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient primal-dual algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while an existing algorithm exhibits oscillatory performance and constraint violation.
Abstract:Evidence-based targeting has been a topic of growing interest among the practitioners of policy and business. Formulating decision-maker's policy learning as a fixed-budget best arm identification (BAI) problem with contextual information, we study an optimal adaptive experimental design for policy learning with multiple treatment arms. In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units upon observing their contextual information (covariates). After the experiment, the planner recommends an individualized assignment rule to the population. Setting the worst-case expected regret as the performance criterion of adaptive sampling and recommended policies, we derive its asymptotic lower bounds, and propose a strategy, Adaptive Sampling-Policy Learning strategy (PLAS), whose leading factor of the regret upper bound aligns with the lower bound as the size of experimental units increases.
Abstract:This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption. At each round, a decision-maker observes an independent and identically distributed context and then selects an arm based on the context and past observations. After selecting an arm, the decision-maker incurs a loss corresponding to the selected arm. The decision-maker aims to minimize the cumulative loss over the trial. The goal of this study is to develop a strategy that is effective in both stochastic and adversarial environments, with theoretical guarantees. We first formulate the problem by introducing a novel setting of bandits with adversarial corruption, referred to as the contextual adversarial regime with a self-bounding constraint. We assume linear models for the relationship between the loss and the context. Then, we propose a strategy that extends the RealLinExp3 by Neu & Olkhovskaya (2020) and the Follow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is shown to be upper-bounded by $O\left(\min\left\{\frac{(\log(T))^3}{\Delta_{*}} + \sqrt{\frac{C(\log(T))^3}{\Delta_{*}}},\ \ \sqrt{T}(\log(T))^2\right\}\right)$, where $T \in\mathbb{N}$ is the number of rounds, $\Delta_{*} > 0$ is the constant minimum gap between the best and suboptimal arms for any context, and $C\in[0, T] $ is an adversarial corruption parameter. This regret upper bound implies $O\left(\frac{(\log(T))^3}{\Delta_{*}}\right)$ in a stochastic environment and by $O\left( \sqrt{T}(\log(T))^2\right)$ in an adversarial environment. We refer to our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its theoretical guarantees in both stochastic and adversarial regimes.