Abstract:We study the problem of estimating a function in the presence of both speckle and additive noises. Although additive noise has been thoroughly explored in nonparametric estimation, speckle noise, prevalent in applications such as synthetic aperture radar, ultrasound imaging, and digital holography, has not received as much attention. Consequently, there is a lack of theoretical investigations into the fundamental limits of mitigating the speckle noise. This paper is the first step in filling this gap. Our focus is on investigating the minimax estimation error for estimating a $\beta$-H\"older continuous function and determining the rate of the minimax risk. Specifically, if $n$ represents the number of data points, $f$ denotes the underlying function to be estimated, and $\hat{\nu}_n$ is an estimate of $f$, then $\inf_{\hat{\nu}_n} \sup_f \mathbb{E}_f\| \hat{\nu}_n - f \|^2_2$ decays at the rate $n^{-\frac{2\beta}{2\beta+1}}$. Interestingly, this rate is identical to the one achieved for mitigating additive noise when the noise's variance is $\Theta(1)$. To validate the accuracy of our minimax upper bounds, we implement the minimax optimal algorithms on simulated data and employ Monte Carlo simulations to characterize their exact risk. Our simulations closely mirror the expected behaviors in decay rate as per our theory.
Abstract:We investigate both the theoretical and algorithmic aspects of likelihood-based methods for recovering a complex-valued signal from multiple sets of measurements, referred to as looks, affected by speckle (multiplicative) noise. Our theoretical contributions include establishing the first existing theoretical upper bound on the Mean Squared Error (MSE) of the maximum likelihood estimator under the deep image prior hypothesis. Our theoretical results capture the dependence of MSE upon the number of parameters in the deep image prior, the number of looks, the signal dimension, and the number of measurements per look. On the algorithmic side, we introduce the concept of bagged Deep Image Priors (Bagged-DIP) and integrate them with projected gradient descent. Furthermore, we show how employing Newton-Schulz algorithm for calculating matrix inverses within the iterations of PGD reduces the computational complexity of the algorithm. We will show that this method achieves the state-of-the-art performance.
Abstract:Despite a large and significant body of recent work focused on estimating the out-of-sample risk of regularized models in the high dimensional regime, a theoretical understanding of this problem for non-differentiable penalties such as generalized LASSO and nuclear norm is missing. In this paper we resolve this challenge. We study this problem in the proportional high dimensional regime where both the sample size n and number of features p are large, and n/p and the signal-to-noise ratio (per observation) remain finite. We provide finite sample upper bounds on the expected squared error of leave-one-out cross-validation (LO) in estimating the out-of-sample risk. The theoretical framework presented here provides a solid foundation for elucidating empirical findings that show the accuracy of LO.
Abstract:The out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding approach to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is a computationally efficient and statistically reliable estimate of LO (and OO) for generalized linear models with differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO's error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with non-differentiable regularizers. We bound the error |ALO - LO| in terms of intuitive metrics such as the size of leave-i-out perturbations in active sets, sample size n, number of features p and regularization parameters. As a consequence, for the $\ell_1$-regularized problems, we show that |ALO - LO| goes to zero as p goes to infinity while n/p and SNR are fixed and bounded.
Abstract:We consider a nonlinear inverse problem $\mathbf{y}= f(\mathbf{Ax})$, where observations $\mathbf{y} \in \mathbb{R}^m$ are the componentwise nonlinear transformation of $\mathbf{Ax} \in \mathbb{R}^m$, $\mathbf{x} \in \mathbb{R}^n$ is the signal of interest and $\mathbf{A}$ is a known linear mapping. By properly specifying the nonlinear processing function, this model can be particularized to many signal processing problems, including compressed sensing and phase retrieval. Our main goal in this paper is to understand the impact of sensing matrices, or more specifically the spectrum of sensing matrices, on the difficulty of recovering $\mathbf{x}$ from $\mathbf{y}$. Towards this goal, we study the performance of one of the most successful recovery methods, i.e. the expectation propagation algorithm (EP). We define a notion for the spikiness of the spectrum of $\mathbf{A}$ and show the importance of this measure in the performance of the EP. Whether the spikiness of the spectrum can hurt or help the recovery performance of EP depends on $f$. We define certain quantities based on the function $f$ that enables us to describe the impact of the spikiness of the spectrum on EP recovery. Based on our framework, we are able to show that for instance, in phase-retrieval problems, matrices with spikier spectrums are better for EP, while in 1-bit compressed sensing problems, less spiky (flatter) spectrums offer better recoveries. Our results unify and substantially generalize the existing results that compare sub-Gaussian and orthogonal matrices, and provide a platform toward designing optimal sensing systems.
Abstract:We obtain concentration and large deviation for the sums of independent and identically distributed random variables with heavy-tailed distributions. Our concentration results are concerned with random variables whose distributions satisfy $P(X>t) \leq {\rm e}^{- I(t)}$, where $I: \mathbb{R} \rightarrow \mathbb{R}$ is an increasing function and $I(t)/t \rightarrow \alpha \in [0, \infty)$ as $t \rightarrow \infty$. Our main theorem can not only recover some of the existing results, such as the concentration of the sum of subWeibull random variables, but it can also produce new results for the sum of random variables with heavier tails. We show that the concentration inequalities we obtain are sharp enough to offer large deviation results for the sums of independent random variables as well. Our analyses which are based on standard truncation arguments simplify, unify and generalize the existing results on the concentration and large deviation of heavy-tailed random variables.
Abstract:We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size $n$ and number of features $p$ are large, and $n/p$ can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as $n,p \rightarrow \infty$, even when the dimension $p$ of the feature vectors is comparable with or greater than the sample size $n$. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.
Abstract:A recently proposed SLOPE estimator (arXiv:1407.3824) has been shown to adaptively achieve the minimax $\ell_2$ estimation rate under high-dimensional sparse linear regression models (arXiv:1503.08393). Such minimax optimality holds in the regime where the sparsity level $k$, sample size $n$, and dimension $p$ satisfy $k/p \rightarrow 0$, $k\log p/n \rightarrow 0$. In this paper, we characterize the estimation error of SLOPE under the complementary regime where both $k$ and $n$ scale linearly with $p$, and provide new insights into the performance of SLOPE estimators. We first derive a concentration inequality for the finite sample mean square error (MSE) of SLOPE. The quantity that MSE concentrates around takes a complicated and implicit form. With delicate analysis of the quantity, we prove that among all SLOPE estimators, LASSO is optimal for estimating $k$-sparse parameter vectors that do not have tied non-zero components in the low noise scenario. On the other hand, in the large noise scenario, the family of SLOPE estimators are sub-optimal compared with bridge regression such as the Ridge estimator.
Abstract:Risk estimation is at the core of many learning systems. The importance of this problem has motivated researchers to propose different schemes, such as cross validation, generalized cross validation, and Bootstrap. The theoretical properties of such estimates have been extensively studied in the low-dimensional settings, where the number of predictors $p$ is much smaller than the number of observations $n$. However, a unifying methodology accompanied with a rigorous theory is lacking in high-dimensional settings. This paper studies the problem of risk estimation under the high-dimensional asymptotic setting $n,p \rightarrow \infty$ and $n/p \rightarrow \delta$ ($\delta$ is a fixed number), and proves the consistency of three risk estimates that have been successful in numerical studies, i.e., leave-one-out cross validation (LOOCV), approximate leave-one-out (ALO), and approximate message passing (AMP)-based techniques. A corner stone of our analysis is a bound that we obtain on the discrepancy of the `residuals' obtained from AMP and LOOCV. This connection not only enables us to obtain a more refined information on the estimates of AMP, ALO, and LOOCV, but also offers an upper bound on the convergence rate of each estimate.
Abstract:Expectation Maximization (EM) is among the most popular algorithms for maximum likelihood estimation, but it is generally only guaranteed to find its stationary points of the log-likelihood objective. The goal of this article is to present theoretical and empirical evidence that over-parameterization can help EM avoid spurious local optima in the log-likelihood. We consider the problem of estimating the mean vectors of a Gaussian mixture model in a scenario where the mixing weights are known. Our study shows that the global behavior of EM, when one uses an over-parameterized model in which the mixing weights are treated as unknown, is better than that when one uses the (correct) model with the mixing weights fixed to the known values. For symmetric Gaussians mixtures with two components, we prove that introducing the (statistically redundant) weight parameters enables EM to find the global maximizer of the log-likelihood starting from almost any initial mean parameters, whereas EM without this over-parameterization may very often fail. For other Gaussian mixtures, we provide empirical evidence that shows similar behavior. Our results corroborate the value of over-parameterization in solving non-convex optimization problems, previously observed in other domains.