Abstract:$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Semidefinite programming (SDP) relaxations have recently been proposed for solving the $K$-means optimization problem that enjoy strong statistical optimality guarantees, but the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. By contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm that is widely used by machine learning practitioners, but without a solid statistical underpinning nor rigorous guarantees. In this paper, we describe an NMF-like algorithm that works by solving a nonnegative low-rank restriction of the SDP relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is just as simple and scalable as state-of-the-art NMF algorithms, while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves substantially smaller mis-clustering errors compared to the existing state-of-the-art.
Abstract:We study the problem of estimating a low-rank matrix from noisy measurements, with the specific goal of achieving minimax optimal error. In practice, the problem is commonly solved using non-convex gradient descent, due to its ability to scale to large-scale real-world datasets. In theory, non-convex gradient descent is capable of achieving minimax error. But in practice, it often converges extremely slowly, such that it cannot even deliver estimations of modest accuracy within reasonable time. On the other hand, methods that improve the convergence of non-convex gradient descent, through rescaling or preconditioning, also greatly amplify the measurement noise, resulting in estimations that are orders of magnitude less accurate than what is theoretically achievable with minimax optimal error. In this paper, we propose a slight modification to the usual non-convex gradient descent method that remedies the issue of slow convergence, while provably preserving its minimax optimality. Our proposed algorithm has essentially the same per-iteration cost as non-convex gradient descent, but is guaranteed to converge to minimax error at a linear rate that is immune to ill-conditioning. Using our proposed algorithm, we reconstruct a 60 megapixel dataset for a medical imaging application, and observe significantly decreased reconstruction error compared to previous approaches.
Abstract:To rigorously certify the robustness of neural networks to adversarial perturbations, most state-of-the-art techniques rely on a triangle-shaped linear programming (LP) relaxation of the ReLU activation. While the LP relaxation is exact for a single neuron, recent results suggest that it faces an inherent "convex relaxation barrier" as additional activations are added, and as the attack budget is increased. In this paper, we propose a nonconvex relaxation for the ReLU relaxation, based on a low-rank restriction of a semidefinite programming (SDP) relaxation. We show that the nonconvex relaxation has a similar complexity to the LP relaxation, but enjoys improved tightness that is comparable to the much more expensive SDP relaxation. Despite nonconvexity, we prove that the verification problem satisfies constraint qualification, and therefore a Riemannian staircase approach is guaranteed to compute a near-globally optimal solution in polynomial time. Our experiments provide evidence that our nonconvex relaxation almost completely overcome the "convex relaxation barrier" faced by the LP relaxation.
Abstract:This paper focuses on complete dictionary learning problem, where the goal is to reparametrize a set of given signals as linear combinations of atoms from a learned dictionary. There are two main challenges faced by theoretical and practical studies of dictionary learning: the lack of theoretical guarantees for practically-used heuristic algorithms, and their poor scalability when dealing with huge-scale datasets. Towards addressing these issues, we show that when the dictionary to be learned is orthogonal, that an alternating minimization method directly applied to the nonconvex and discrete formulation of the problem exactly recovers the ground truth. For the huge-scale, potentially online setting, we propose a minibatch version of our algorithm, which can provably learn a complete dictionary from a huge-scale dataset with minimal sample complexity, linear sparsity level, and linear convergence rate, thereby negating the need for any convex relaxation for the problem. Our numerical experiments showcase the superiority of our method compared with the existing techniques when applied to tasks on real data.
Abstract:The matrix completion problem seeks to recover a $d\times d$ ground truth matrix of low rank $r\ll d$ from observations of its individual elements. Real-world matrix completion is often a huge-scale optimization problem, with $d$ so large that even the simplest full-dimension vector operations with $O(d)$ time complexity become prohibitively expensive. Stochastic gradient descent (SGD) is one of the few algorithms capable of solving matrix completion on a huge scale, and can also naturally handle streaming data over an evolving ground truth. Unfortunately, SGD experiences a dramatic slow-down when the underlying ground truth is ill-conditioned; it requires at least $O(\kappa\log(1/\epsilon))$ iterations to get $\epsilon$-close to ground truth matrix with condition number $\kappa$. In this paper, we propose a preconditioned version of SGD that preserves all the favorable practical qualities of SGD for huge-scale online optimization while also making it agnostic to $\kappa$. For a symmetric ground truth and the Root Mean Square Error (RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$-accuracy in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence rate as if the ground truth were perfectly conditioned with $\kappa=1$. In our numerical experiments, we observe a similar acceleration for ill-conditioned matrix completion under the 1-bit cross-entropy loss, as well as pairwise losses such as the Bayesian Personalized Ranking (BPR) loss.
Abstract:We consider minimizing a twice-differentiable, $L$-smooth, and $\mu$-strongly convex objective $\phi$ over an $n\times n$ positive semidefinite matrix $M\succeq0$, under the assumption that the minimizer $M^{\star}$ has low rank $r^{\star}\ll n$. Following the Burer--Monteiro approach, we instead minimize the nonconvex objective $f(X)=\phi(XX^{T})$ over a factor matrix $X$ of size $n\times r$. This substantially reduces the number of variables from $O(n^{2})$ to as few as $O(n)$ and also enforces positive semidefiniteness for free, but at the cost of giving up the convexity of the original problem. In this paper, we prove that if the search rank $r\ge r^{\star}$ is overparameterized by a constant factor with respect to the true rank $r^{\star}$, namely as in $r>\frac{1}{4}(L/\mu-1)^{2}r^{\star}$, then despite nonconvexity, local optimization is guaranteed to globally converge from any initial point to the global optimum. This significantly improves upon a previous rank overparameterization threshold of $r\ge n$, which is known to be sharp if $\phi$ is allowed to be nonsmooth and/or non-strongly convex, but would increase the number of variables back up to $O(n^{2})$. Conversely, without rank overparameterization, we prove that such a global guarantee is possible if and only if $\phi$ is almost perfectly conditioned, with a condition number of $L/\mu<3$. Therefore, we conclude that a small amount of overparameterization can lead to large improvements in theoretical guarantees for the nonconvex Burer--Monteiro factorization.
Abstract:We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
Abstract:We prove that it is possible for nonconvex low-rank matrix recovery to contain no spurious local minima when the rank of the unknown ground truth $r^{\star}<r$ is strictly less than the search rank $r$, and yet for the claim to be false when $r^{\star}=r$. Under the restricted isometry property (RIP), we prove, for the general overparameterized regime with $r^{\star}\le r$, that an RIP constant of $\delta<1/(1+\sqrt{r^{\star}/r})$ is sufficient for the inexistence of spurious local minima, and that $\delta<1/(1+1/\sqrt{r-r^{\star}+1})$ is necessary due to existence of counterexamples. Without an explicit control over $r^{\star}\le r$, an RIP constant of $\delta<1/2$ is both necessary and sufficient for the exact recovery of a rank-$r$ ground truth. But if the ground truth is known a priori to have $r^{\star}=1$, then the sharp RIP threshold for exact recovery is improved to $\delta<1/(1+1/\sqrt{r})$.
Abstract:The robustness of a neural network to adversarial examples can be provably certified by solving a convex relaxation. If the relaxation is loose, however, then the resulting certificate can be too conservative to be practically useful. Recently, a less conservative robustness certificate was proposed, based on a semidefinite programming (SDP) relaxation of the ReLU activation function. In this paper, we give a geometric analysis for the tightness of this relaxation. We show that, for a least-squares restriction of the usual adversarial attack problem, the SDP relaxation is tight over a single hidden layer under reasonable assumptions. The resulting robustness certificate is exact, meaning that it provides a lower-bound on the size of the smallest adversarial perturbation, as well as a globally optimal perturbation that attains the lower-bound. For several hidden layers, the SDP relaxation is not usually tight; we give an explanation using the underlying hyperbolic geometry. We experimentally confirm our theoretical insights using a general-purpose interior-point method and a custom rank-2 Burer-Monteiro algorithm.
Abstract:Nonconvex matrix recovery is known to contain no spurious local minima under a restricted isometry property (RIP) with a sufficiently small RIP constant $\delta$. If $\delta$ is too large, however, then counterexamples containing spurious local minima are known to exist. In this paper, we introduce a proof technique that is capable of establishing sharp thresholds on $\delta$ to guarantee the inexistence of spurious local minima. Using the technique, we prove that in the case of a rank-1 ground truth, an RIP constant of $\delta<1/2$ is both necessary and sufficient for exact recovery from any arbitrary initial point (such as a random point). We also prove a local recovery result: given an initial point $x_{0}$ satisfying $f(x_{0})\le(1-\delta)^{2}f(0)$, any descent algorithm that converges to second-order optimality guarantees exact recovery.