Abstract:A neural architecture with randomly initialized weights, in the infinite width limit, is equivalent to a Gaussian Random Field whose covariance function is the so-called Neural Network Gaussian Process kernel (NNGP). We prove that a reproducing kernel Hilbert space (RKHS) defined by the NNGP contains only functions that can be approximated by the architecture. To achieve a certain approximation error the required number of neurons in each layer is defined by the RKHS norm of the target function. Moreover, the approximation can be constructed from a supervised dataset by a random multi-layer representation of an input vector, together with training of the last layer's weights. For a 2-layer NN and a domain equal to an $n-1$-dimensional sphere in ${\mathbb R}^n$, we compare the number of neurons required by Barron's theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than $k^{-n-\frac{2}{3}}$ where $k$ is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron's theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees.
Abstract:Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.
Abstract:The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases.
Abstract:We formulate the manifold learning problem as the problem of finding an operator that maps any point to a close neighbor that lies on a ``hidden'' $k$-dimensional manifold. We call this operator the correcting function. Under this formulation, autoencoders can be viewed as a tool to approximate the correcting function. Given an autoencoder whose Jacobian has rank $k$, we deduce from the classical Constant Rank Theorem that its range has a structure of a $k$-dimensional manifold. A $k$-dimensionality of the range can be forced by the architecture of an autoencoder (by fixing the dimension of the code space), or alternatively, by an additional constraint that the rank of the autoencoder mapping is not greater than $k$. This constraint is included in the objective function as a new term, namely a squared Ky-Fan $k$-antinorm of the Jacobian function. We claim that this constraint is a factor that effectively reduces the dimension of the range of an autoencoder, additionally to the reduction defined by the architecture. We also add a new curvature term into the objective. To conclude, we experimentally compare our approach with the CAE+H method on synthetic and real-world datasets.
Abstract:Valued constraint satisfaction problems with ordered variables (VCSPO) are a special case of Valued CSPs in which variables are totally ordered and soft constraints are imposed on tuples of variables that do not violate the order. We study a restriction of VCSPO, in which soft constraints are imposed on a segment of adjacent variables and a constraint language $\Gamma$ consists of $\{0,1\}$-valued characteristic functions of predicates. This kind of potentials generalizes the so-called pattern-based potentials, which were applied in many tasks of structured prediction. For a constraint language $\Gamma$ we introduce a closure operator, $ \overline{\Gamma^{\cap}}\supseteq \Gamma$, and give examples of constraint languages for which $|\overline{\Gamma^{\cap}}|$ is small. If all predicates in $\Gamma$ are cartesian products, we show that the minimization of a generalized pattern-based potential (or, the computation of its partition function) can be made in ${\mathcal O}(|V|\cdot |D|^2 \cdot |\overline{\Gamma^{\cap}}|^2 )$ time, where $V$ is a set of variables, $D$ is a domain set. If, additionally, only non-positive weights of constraints are allowed, the complexity of the minimization task drops to ${\mathcal O}(|V|\cdot |\overline{\Gamma^{\cap}}| \cdot |D| \cdot \max_{\rho\in \Gamma}\|\rho\|^2 )$ where $\|\rho\|$ is the arity of $\rho\in \Gamma$. For a general language $\Gamma$ and non-positive weights, the minimization task can be carried out in ${\mathcal O}(|V|\cdot |\overline{\Gamma^{\cap}}|^2)$ time. We argue that in many natural cases $\overline{\Gamma^{\cap}}$ is of moderate size, though in the worst case $|\overline{\Gamma^{\cap}}|$ can blow up and depend exponentially on $\max_{\rho\in \Gamma}\|\rho\|$.
Abstract:The classical Mercer's theorem claims that a continuous positive definite kernel $K({\mathbf x}, {\mathbf y})$ on a compact set can be represented as $\sum_{i=1}^\infty \lambda_i\phi_i({\mathbf x})\phi_i({\mathbf y})$ where $\{(\lambda_i,\phi_i)\}$ are eigenvalue-eigenvector pairs of the corresponding integral operator. This infinite representation is known to converge uniformly to the kernel $K$. We estimate the speed of this convergence in terms of the decay rate of eigenvalues and demonstrate that for $3m$ times differentiable kernels the first $N$ terms of the series approximate $K$ as $\mathcal{O}\big((\sum_{i=N+1}^\infty\lambda_i)^{\frac{m}{m+n}}\big)$ or $\mathcal{O}\big((\sum_{i=N+1}^\infty\lambda^2_i)^{\frac{m}{2m+n}}\big)$.
Abstract:We develop new upper and lower bounds on the $\varepsilon$-entropy of a unit ball in a reproducing kernel Hilbert space induced by some Mercer kernel $K$. Our bounds are based on the behaviour of eigenvalues of a corresponding integral operator. In our approach we exploit an ellipsoidal structure of a unit ball in RKHS and a previous work on covering numbers of an ellipsoid in the euclidean space obtained by Dumer, Pinsker and Prelov. We present a number of applications of our main bound, such as its tightness for a practically important case of the Gaussian kernel. Further, we develop a series of lower bounds on the $\varepsilon$-entropy that can be established from a connection between covering numbers of a ball in RKHS and a quantization of a Gaussian Random Field that corresponds to the kernel $K$ by the Kosambi-Karhunen-Lo\`eve transform.
Abstract:We present a new way of study of Mercer kernels, by corresponding to a special kernel $K$ a pseudo-differential operator $p({\mathbf x}, D)$ such that $\mathcal{F} p({\mathbf x}, D)^\dag p({\mathbf x}, D) \mathcal{F}^{-1}$ acts on smooth functions in the same way as an integral operator associated with $K$ (where $\mathcal{F}$ is the Fourier transform). We show that kernels defined by pseudo-differential operators are able to approximate uniformly any continuous Mercer kernel on a compact set. The symbol $p({\mathbf x}, {\mathbf y})$ encapsulates a lot of useful information about the structure of the Maximum Mean Discrepancy distance defined by the kernel $K$. We approximate $p({\mathbf x}, {\mathbf y})$ with the sum of the first $r$ terms of the Singular Value Decomposition of $p$, denoted by $p_r({\mathbf x}, {\mathbf y})$. If ordered singular values of the integral operator associated with $p({\mathbf x}, {\mathbf y})$ die down rapidly, the MMD distance defined by the new symbol $p_r$ differs from the initial one only slightly. Moreover, the new MMD distance can be interpreted as an aggregated result of comparing $r$ local moments of two probability distributions. The latter results holds under the condition that right singular vectors of the integral operator associated with $p$ are uniformly bounded. But even if this is not satisfied we can still hold that the Hilbert-Schmidt distance between $p$ and $p_r$ vanishes. Thus, we report an interesting phenomenon: the MMD distance measures the difference of two probability distributions with respect to a certain number of local moments, $r^\ast$, and this number $r^\ast$ depends on the speed with which singular values of $p$ die down.
Abstract:We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word.
Abstract:Classical dimension reduction problem can be loosely formulated as a problem of finding a $k$-dimensional affine subspace of ${\mathbb R}^n$ onto which data points ${\mathbf x}_1,\cdots, {\mathbf x}_N$ can be projected without loss of valuable information. We reformulate this problem in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function $p_{\rm{emp}}({\mathbf x}) = \frac{1}{N} \sum_{i=1}^N \delta^n (\bold{x} - \bold{x}_i)$, where $\delta^n$ is an $n$-dimensional Dirac delta function, by another tempered distribution $q({\mathbf x})$ whose density is supported in some $k$-dimensional subspace. Thus, our problem is reduced to the minimization of a certain loss function $I(q)$ measuring the distance from $q$ to $p_{\rm{emp}}$ over a pertinent set of generalized functions, denoted $\mathcal{G}_k$. Another classical problem of data analysis is the sufficient dimension reduction problem. We show that it can be reduced to the following problem: given a function $f: {\mathbb R}^n\rightarrow {\mathbb R}$ and a probability density function $p({\mathbf x})$, find a function of the form $g({\mathbf w}^T_1{\mathbf x}, \cdots, {\mathbf w}^T_k{\mathbf x})$ that minimizes the loss ${\mathbb E}_{{\mathbf x}\sim p} |f({\mathbf x})-g({\mathbf w}^T_1{\mathbf x}, \cdots, {\mathbf w}^T_k{\mathbf x})|^2$. We first show that search spaces of the latter two problems are in one-to-one correspondence which is defined by the Fourier transform. We introduce a nonnegative penalty function $R(f)$ and a set of ordinary functions $\Omega_\epsilon = \{f| R(f)\leq \epsilon\}$ in such a way that $\Omega_\epsilon$ `approximates' the space $\mathcal{G}_k$ when $\epsilon \rightarrow 0$. Then we present an algorithm for minimization of $I(f)+\lambda R(f)$, based on the idea of two-step iterative computation.