University of Bayreuth
Abstract:In this paper some methods to use the empirical bootstrap approach for stochastic gradient descent (SGD) to minimize the empirical risk over a separable Hilbert space are investigated from the view point of algorithmic stability and statistical robustness. The first two types of approaches are based on averages and are investigated from a theoretical point of view. A generalization analysis for bootstrap SGD of Type 1 and Type 2 based on algorithmic stability is done. Another type of bootstrap SGD is proposed to demonstrate that it is possible to construct purely distribution-free pointwise confidence intervals of the median curve using bootstrap SGD.
Abstract:In this paper, we study an online learning algorithm with a robust loss function $\mathcal{L}_{\sigma}$ for regression over a reproducing kernel Hilbert space (RKHS). The loss function $\mathcal{L}_{\sigma}$ involving a scaling parameter $\sigma>0$ can cover a wide range of commonly used robust losses. The proposed algorithm is then a robust alternative for online least squares regression aiming to estimate the conditional mean function. For properly chosen $\sigma$ and step size, we show that the last iterate of this online algorithm can achieve optimal capacity independent convergence in the mean square distance. Moreover, if additional information on the underlying function space is known, we also establish optimal capacity dependent rates for strong convergence in RKHS. To the best of our knowledge, both of the two results are new to the existing literature of online learning.
Abstract:Regularized kernel-based methods such as support vector machines (SVMs) typically depend on the underlying probability measure $\mathrm{P}$ (respectively an empirical measure $\mathrm{D}_n$ in applications) as well as on the regularization parameter $\lambda$ and the kernel $k$. Whereas classical statistical robustness only considers the effect of small perturbations in $\mathrm{P}$, the present paper investigates the influence of simultaneous slight variations in the whole triple $(\mathrm{P},\lambda,k)$, respectively $(\mathrm{D}_n,\lambda_n,k)$, on the resulting predictor. Existing results from the literature are considerably generalized and improved. In order to also make them applicable to big data, where regular SVMs suffer from their super-linear computational requirements, we show how our results can be transferred to the context of localized learning. Here, the effect of slight variations in the applied regionalization, which might for example stem from changes in $\mathrm{P}$ respectively $\mathrm{D}_n$, is considered as well.
Abstract:It is shown that many results on the statistical robustness of kernel-based pairwise learning can be derived under basically no assumptions on the input and output spaces. In particular neither moment conditions on the conditional distribution of Y given X = x nor the boundedness of the output space is needed. We obtain results on the existence and boundedness of the influence function and show qualitative robustness of the kernel-based estimator. The present paper generalizes results by Christmann and Zhou (2016) by allowing the prediction function to take two arguments and can thus be applied in a variety of situations such as ranking.
Abstract:Regularized empirical risk minimization using kernels and their corresponding reproducing kernel Hilbert spaces (RKHSs) plays an important role in machine learning. However, the actually used kernel often depends on one or on a few hyperparameters or the kernel is even data dependent in a much more complicated manner. Examples are Gaussian RBF kernels, kernel learning, and hierarchical Gaussian kernels which were recently proposed for deep learning. Therefore, the actually used kernel is often computed by a grid search or in an iterative manner and can often only be considered as an approximation to the "ideal" or "optimal" kernel. The paper gives conditions under which classical kernel based methods based on a convex Lipschitz loss function and on a bounded and smooth kernel are stable, if the probability measure $P$, the regularization parameter $\lambda$, and the kernel $k$ may slightly change in a simultaneous manner. Similar results are also given for pairwise learning. Therefore, the topic of this paper is somewhat more general than in classical robust statistics, where usually only the influence of small perturbations of the probability measure $P$ on the estimated function is considered.
Abstract:Statistical machine learning plays an important role in modern statistics and computer science. One main goal of statistical machine learning is to provide universally consistent algorithms, i.e., the estimator converges in probability or in some stronger sense to the Bayes risk or to the Bayes decision function. Kernel methods based on minimizing the regularized risk over a reproducing kernel Hilbert space (RKHS) belong to these statistical machine learning methods. It is in general unknown which kernel yields optimal results for a particular data set or for the unknown probability measure. Hence various kernel learning methods were proposed to choose the kernel and therefore also its RKHS in a data adaptive manner. Nevertheless, many practitioners often use the classical Gaussian RBF kernel or certain Sobolev kernels with good success. The goal of this short note is to offer one possible theoretical explanation for this empirical fact.
Abstract:Regularized empirical risk minimization including support vector machines plays an important role in machine learning theory. In this paper regularized pairwise learning (RPL) methods based on kernels will be investigated. One example is regularized minimization of the error entropy loss which has recently attracted quite some interest from the viewpoint of consistency and learning rates. This paper shows that such RPL methods have additionally good statistical robustness properties, if the loss function and the kernel are chosen appropriately. We treat two cases of particular interest: (i) a bounded and non-convex loss function and (ii) an unbounded convex loss function satisfying a certain Lipschitz type condition.
Abstract:Additive models play an important role in semiparametric statistics. This paper gives learning rates for regularized kernel based methods for additive models. These learning rates compare favourably in particular in high dimensions to recent results on optimal learning rates for purely nonparametric regularized kernel based quantile regression using the Gaussian radial basis function kernel, provided the assumption of an additive model is valid. Additionally, a concrete example is presented to show that a Gaussian function depending only on one variable lies in a reproducing kernel Hilbert space generated by an additive Gaussian kernel, but does not belong to the reproducing kernel Hilbert space generated by the multivariate Gaussian kernel of the same variance.
Abstract:It is shown that bootstrap approximations of support vector machines (SVMs) based on a general convex and smooth loss function and on a general kernel are consistent. This result is useful to approximate the unknown finite sample distribution of SVMs by the bootstrap approach.
Abstract:It is shown that bootstrap approximations of an estimator which is based on a continuous operator from the set of Borel probability measures defined on a compact metric space into a complete separable metric space is stable in the sense of qualitative robustness. Support vector machines based on shifted loss functions are treated as special cases.