Abstract:Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second that of its accuracy when it exists. These properties depend on both the dimension of covariates and on the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.
Abstract:In the problem of aggregation, the aim is to combine a given class of base predictors to achieve predictions nearly as accurate as the best one. In this flexible framework, no assumption is made on the structure of the class or the nature of the target. Aggregation has been studied in both sequential and statistical contexts. Despite some important differences between the two problems, the classical results in both cases feature the same global complexity measure. In this paper, we revisit and tighten classical results in the theory of aggregation in the statistical setting by replacing the global complexity with a smaller, local one. Some of our proofs build on the PAC-Bayes localization technique introduced by Catoni. Among other results, we prove localized versions of the classical bound for the exponential weights estimator due to Leung and Barron and deviation-optimal bounds for the Q-aggregation estimator. These bounds improve over the results of Dai, Rigollet and Zhang for fixed design regression and the results of Lecu\'e and Rigollet for random design regression.
Abstract:We study sequential probability assignment in the Gaussian setting, where the goal is to predict, or equivalently compress, a sequence of real-valued observations almost as well as the best Gaussian distribution with mean constrained to a given subset of $\mathbf{R}^n$. First, in the case of a convex constraint set $K$, we express the hardness of the prediction problem (the minimax regret) in terms of the intrinsic volumes of $K$; specifically, it equals the logarithm of the Wills functional from convex geometry. We then establish a comparison inequality for the Wills functional in the general nonconvex case, which underlines the metric nature of this quantity and generalizes the Slepian-Sudakov-Fernique comparison principle for the Gaussian width. Motivated by this inequality, we characterize the exact order of magnitude of the considered functional for a general nonconvex set, in terms of global covering numbers and local Gaussian widths. This implies metric isomorphic estimates for the log-Laplace transform of the intrinsic volume sequence of a convex body. As part of our analysis, we also characterize the minimax redundancy for a general constraint set. We finally relate and contrast our findings with classical asymptotic results in information theory.
Abstract:In this short note, we present an elementary analysis of the prediction error of ridge regression with random design. The proof is short and self-contained. In particular, it avoids matrix concentration or control of empirical processes, by using a simple combination of exchangeability arguments, matrix identities and operator convexity.
Abstract:We study random design linear regression with no assumptions on the distribution of the covariates and with a heavy-tailed response variable. When learning without assumptions on the covariates, we establish boundedness of the conditional second moment of the response variable as a necessary and sufficient condition for achieving deviation-optimal excess risk rate of convergence. In particular, combining the ideas of truncated least squares, median-of-means procedures and aggregation theory, we construct a non-linear estimator achieving excess risk of order $d/n$ with the optimal sub-exponential tail. While the existing approaches to learning linear classes under heavy-tailed distributions focus on proper estimators, we highlight that the improperness of our estimator is necessary for attaining non-trivial guarantees in the distribution-free setting considered in this work. Finally, as a byproduct of our analysis, we prove an optimal version of the classical bound for the truncated least squares estimator due to Gy\"{o}rfi, Kohler, Krzyzak, and Walk.
Abstract:We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data. This approach naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. Our main results show the existence of different regimes, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance. Theoretical results are complemented with numerical experiments on large scale benchmark data sets.
Abstract:We analyze the prediction performance of ridge and ridgeless regression when both the number and the dimension of the data go to infinity. In particular, we consider a general setting introducing prior assumptions characterizing "easy" and "hard" learning problems. In this setting, we show that ridgeless (zero regularisation) regression is optimal for easy problems with a high signal to noise. Furthermore, we show that additional descents in the ridgeless bias and variance learning curve can occur beyond the interpolating threshold, verifying recent empirical observations. More generally, we show how a variety of learning curves are possible depending on the problem at hand. From a technical point of view, characterising the influence of prior assumptions requires extending previous applications of random matrix theory to study ridge regression.
Abstract:We introduce a procedure for predictive conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This predictor minimizes a new general excess risk bound, which critically remains valid under model misspecification. On standard examples, this bound scales as $d/n$ where $d$ is the dimension of the model and $n$ the sample size, regardless of the true distribution. The SMP, which is an improper (out-of-model) procedure, improves over proper (within-model) estimators (such as the maximum likelihood estimator), whose excess risk can degrade arbitrarily in the misspecified case. For density estimation, our bounds improve over approaches based on online-to-batch conversion, by removing suboptimal $\log n$ factors, addressing an open problem from Gr{\"u}nwald and Kot{\l}owski (2011) for the considered models. For the Gaussian linear model, the SMP admits an explicit expression, and its expected excess risk in the general misspecified case is at most twice the minimax excess risk in the \emph{well-specified case}, but without any condition on the noise variance or approximation error of the linear model. For logistic regression, a penalized SMP can be computed efficiently by training two logistic regressions, and achieves a non-asymptotic excess risk of $O((d + B^2R^2)/n)$, where $R$ is a bound on the norm of the features and $B$ the norm of the comparison linear predictor. This improves the rates of proper (within-model) estimators, since such procedures can achieve no better rate than $\min(BR/\sqrt{n},de^{BR}/n)$ in general. This also provides a computationally more efficient alternative to approaches based on online-to-batch conversion of Bayesian mixture procedures, which require approximate posterior sampling, thereby partly answering a question by Foster et al. (2018).
Abstract:The first part of this paper is devoted to the decision-theoretic analysis of random-design linear prediction with square loss. It is known that, under boundedness constraints on the response (and thus regression coefficients), the minimax excess risk scales as $C\sigma^2d/n$ up to constants, where $d$ is the model dimension, $n$ the sample size, and $\sigma^2$ the noise parameter. Here, we study the expected excess risk with respect to the full linear class. We show that the ordinary least squares (OLS) estimator is minimax optimal in the well-specified case, for every distribution of covariates and noise level. Further, we express the minimax risk in terms of the distribution of statistical leverage scores of individual samples. We deduce a precise minimax lower bound of $\sigma^2d/(n-d+1)$, valid for any distribution of covariates, which nearly matches the risk of OLS for Gaussian covariates. We then obtain nonasymptotic upper bounds on the minimax risk for covariates that satisfy a "small ball"-type regularity condition, which scale as $(1+o(1))\sigma^2d/n$ as $d=o(n)$, both in the well-specified and misspecified cases. Our main technical contribution is the study of the lower tail of the smallest singular value of empirical covariance matrices around $0$. We establish a general lower bound on this lower tail, together with a matching upper bound under a necessary regularity condition. Our proof relies on the PAC-Bayesian technique for controlling empirical processes, and extends an analysis of Oliveira (2016) devoted to a different part of the lower tail. Equivalently, our upper bound shows that the operator norm of the inverse sample covariance matrix has bounded $L^q$ norm up to $q\asymp n$, and this exponent is unimprovable. Finally, we show that the regularity condition on the design naturally holds for independent coordinates.
Abstract:Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training and prediction, and their suitability in high-dimensional settings. The most commonly used RF variants however are "offline" algorithms, which require the availability of the whole dataset at once. In this paper, we introduce AMF, an online random forest algorithm based on Mondrian Forests. Using a variant of the Context Tree Weighting algorithm, we show that it is possible to efficiently perform an exact aggregation over all prunings of the trees; in particular, this enables to obtain a truly online parameter-free algorithm which is competitive with the optimal pruning of the Mondrian tree, and thus adaptive to the unknown regularity of the regression function. Numerical experiments show that AMF is competitive with respect to several strong baselines on a large number of datasets for multi-class classification.