Abstract:We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop a test statistic that is asymptotically normal, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy. The key technical ingredient is a central limit theorem for globally dependent data. We also propose practical ways to select the tuning parameter that adapts to the signal landscape.
Abstract:Online nonparametric estimators are gaining popularity due to their efficient computation and competitive generalization abilities. An important example includes variants of stochastic gradient descent. These algorithms often take one sample point at a time and instantly update the parameter estimate of interest. In this work we consider model selection and hyperparameter tuning for such online algorithms. We propose a weighted rolling-validation procedure, an online variant of leave-one-out cross-validation, that costs minimal extra computation for many typical stochastic gradient descent estimators. Similar to batch cross-validation, it can boost base estimators to achieve a better, adaptive convergence rate. Our theoretical analysis is straightforward, relying mainly on some general statistical stability assumptions. The simulation study underscores the significance of diverging weights in rolling validation in practice and demonstrates its sensitivity even when there is only a slim difference between candidate estimators.
Abstract:Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. Here we consider estimating which data values are incorrect along a numerical column. We present a model-agnostic approach that can utilize any regressor (i.e. statistical or machine learning model) which was fit to predict values in this column based on the other variables in the dataset. By accounting for various uncertainties, our approach distinguishes between genuine anomalies and natural data fluctuations, conditioned on the available information in the dataset. We establish theoretical guarantees for our method and show that other approaches like conformal inference struggle to detect errors. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
Abstract:This work presents a novel K-Repetition based HARQ scheme for LDPC coded uplink SCMA by employing a network coding (NC) principle to encode different packets, where K-Repetition is an emerging technique (recommended in 3GPP Release 15) for enhanced reliability and reduced latency in future massive machine-type communication. Such a scheme is referred to as the NC aided K-repetition SCMA (NCK-SCMA). We introduce a joint iterative detection algorithm for improved detection of the data from the proposed LDPC coded NCKSCMA systems. Simulation results demonstrate the benefits of NCK-SCMA with higher throughput and improved reliability over the conventional K-Repetition SCMA.
Abstract:Grant-free sparse code multiple access (GF-SCMA) is considered to be a promising multiple access candidate for future wireless networks. In this paper, we focus on characterizing the performance of uplink GF-SCMA schemes in a network with ubiquitous connections, such as the Internet of Things (IoT) networks. To provide a tractable approach to evaluate the performance of GF-SCMA, we first develop a theoretical model taking into account the property of multi-user detection (MUD) in the SCMA system. We then analyze the error rate performance of GF-SCMA in the case of codebook collision to investigate the reliability of GF-SCMA when reusing codebook in massive IoT networks. For performance evaluation, accurate approximations for both success probability and average symbol error probability (ASEP) are derived. To elaborate further, we utilize the analytical results to discuss the impact of codeword sparse degree in GFSCMA. After that, we conduct a comparative study between SCMA and its variant, dense code multiple access (DCMA), with GF transmission to offer insights into the effectiveness of these two schemes. This facilitates the GF-SCMA system design in practical implementation. Simulation results show that denser codebooks can help to support more UEs and increase the reliability of data transmission in a GF-SCMA network. Moreover, a higher success probability can be achieved by GFSCMA with denser UE deployment at low detection thresholds since SCMA can achieve overloading gain.
Abstract:Sparse principal component analysis (PCA) is an important technique for dimensionality reduction of high-dimensional data. However, most existing sparse PCA algorithms are based on non-convex optimization, which provide little guarantee on the global convergence. Sparse PCA algorithms based on a convex formulation, for example the Fantope projection and selection (FPS), overcome this difficulty, but are computationally expensive. In this work we study sparse PCA based on the convex FPS formulation, and propose a new algorithm that is computationally efficient and applicable to large and high-dimensional data sets. Nonasymptotic and explicit bounds are derived for both the optimization error and the statistical accuracy, which can be used for testing and inference problems. We also extend our algorithm to online learning problems, where data are obtained in a streaming fashion. The proposed algorithm is applied to high-dimensional gene expression data for the detection of functional gene groups.
Abstract:We provide upper bounds of the expected Wasserstein distance between a probability measure and its empirical version, generalizing recent results for finite dimensional Euclidean spaces and bounded functional spaces. Such a generalization can cover Euclidean spaces with large dimensionality, with the optimal dependence on the dimensionality. Our method also covers the important case of Gaussian processes in separable Hilbert spaces, with rate-optimal upper bounds for functional data distributions whose coordinates decay geometrically or polynomially. Moreover, our bounds of the expected value can be combined with mean-concentration results to yield improved exponential tail probability bounds for the Wasserstein error of empirical measures under a Bernstein-type tail condition.
Abstract:Cross-validation is one of the most popular model selection methods in statistics and machine learning. Despite its wide applicability, traditional cross validation methods tend to select overfitting models, due to the ignorance of the uncertainty in the testing sample. We develop a new, statistically principled inference tool based on cross-validation that takes into account the uncertainty in the testing sample. This new method outputs a set of highly competitive candidate models containing the best one with guaranteed probability. As a consequence, our method can achieve consistent variable selection in a classical linear regression setting, for which existing cross-validation methods require unconventional split ratios. When used for regularizing tuning parameter selection, the method can provide a further trade-off between prediction accuracy and model interpretability. We demonstrate the performance of the proposed method in several simulated and real data examples.
Abstract:We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called {\it rank-one-out} conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, in order to adapt to heteroskedascity in the data. Finally, we propose a model-free notion of variable importance, called {\it leave-one-covariate-out} or LOCO inference. Accompanying this paper is an R package {\tt conformalInference} that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package.
Abstract:In most classification tasks there are observations that are ambiguous and therefore difficult to correctly label. Set-valued classification allows the classifiers to output a set of plausible labels rather than a single label, thereby giving a more appropriate and informative treatment to the labeling of ambiguous instances. We introduce a framework for multiclass set-valued classification, where the classifiers guarantee user-defined levels of coverage or confidence (the probability that the true label is contained in the set) while minimizing the ambiguity (the expected size of the output). We first derive oracle classifiers assuming the true distribution to be known. We show that the oracle classifiers are obtained from level sets of the functions that define the conditional probability of each class. Then we develop estimators with good asymptotic and finite sample properties. The proposed classifiers build on and refine many existing single-label classifiers. The optimal classifier can sometimes output the empty set. We provide two solutions to fix this issue that are suitable for various practical needs.