Abstract:We consider the injectivity property of the ReLU networks layers. Determining the ReLU injectivity capacity (ratio of the number of layer's inputs and outputs) is established as isomorphic to determining the capacity of the so-called $\ell_0$ spherical perceptron. Employing \emph{fully lifted random duality theory} (fl RDT) a powerful program is developed and utilized to handle the $\ell_0$ spherical perceptron and implicitly the ReLU layers injectivity. To put the entire fl RDT machinery in practical use, a sizeable set of numerical evaluations is conducted as well. The lifting mechanism is observed to converge remarkably fast with relative corrections in the estimated quantities not exceeding $\sim 0.1\%$ already on the third level of lifting. Closed form explicit analytical relations among key lifting parameters are uncovered as well. In addition to being of incredible importance in handling all the required numerical work, these relations also shed a new light on beautiful parametric interconnections within the lifting structure. Finally, the obtained results are also shown to fairly closely match the replica predictions from [40].
Abstract:We study the theoretical limits of the $\ell_0$ (quasi) norm based optimization algorithms when employed for solving classical compressed sensing or sparse regression problems. Considering standard contexts with deterministic signals and statistical systems, we utilize \emph{Fully lifted random duality theory} (Fl RDT) and develop a generic analytical program for studying performance of the \emph{maximum-likelihood} (ML) decoding. The key ML performance parameter, the residual \emph{root mean square error} ($\textbf{RMSE}$), is uncovered to exhibit the so-called \emph{phase-transition} (PT) phenomenon. The associated aPT curve, which separates the regions of systems dimensions where \emph{an} $\ell_0$ based algorithm succeeds or fails in achieving small (comparable to the noise) ML optimal $\textbf{RMSE}$ is precisely determined as well. In parallel, we uncover the existence of another dPT curve which does the same separation but for practically feasible \emph{descending} $\ell_0$ ($d\ell_0$) algorithms. Concrete implementation and practical relevance of the Fl RDT typically rely on the ability to conduct a sizeable set of the underlying numerical evaluations which reveal that for the ML decoding the Fl RDT converges astonishingly fast with corrections in the estimated quantities not exceeding $\sim 0.1\%$ already on the third level of lifting. Analytical results are supplemented by a sizeable set of numerical experiments where we implement a simple variant of $d\ell_0$ and demonstrate that its practical performance very accurately matches the theoretical predictions. Completely surprisingly, a remarkably precise agreement between the simulations and the theory is observed for fairly small dimensions of the order of 100.
Abstract:We consider fully row/column-correlated linear regression models and study several classical estimators (including minimum norm interpolators (GLS), ordinary least squares (LS), and ridge regressors). We show that \emph{Random Duality Theory} (RDT) can be utilized to obtain precise closed form characterizations of all estimators related optimizing quantities of interest, including the \emph{prediction risk} (testing or generalization error). On a qualitative level out results recover the risk's well known non-monotonic (so-called double-descent) behavior as the number of features/sample size ratio increases. On a quantitative level, our closed form results show how the risk explicitly depends on all key model parameters, including the problem dimensions and covariance matrices. Moreover, a special case of our results, obtained when intra-sample (or time-series) correlations are not present, precisely match the corresponding ones obtained via spectral methods in [6,16,17,24].
Abstract:We consider correlated \emph{factor} regression models (FRM) and analyze the performance of classical ridge interpolators. Utilizing powerful \emph{Random Duality Theory} (RDT) mathematical engine, we obtain \emph{precise} closed form characterizations of the underlying optimization problems and all associated optimizing quantities. In particular, we provide \emph{excess prediction risk} characterizations that clearly show the dependence on all key model parameters, covariance matrices, loadings, and dimensions. As a function of the over-parametrization ratio, the generalized least squares (GLS) risk also exhibits the well known \emph{double-descent} (non-monotonic) behavior. Similarly to the classical linear regression models (LRM), we demonstrate that such FRM phenomenon can be smoothened out by the optimally tuned ridge regularization. The theoretical results are supplemented by numerical simulations and an excellent agrement between the two is observed. Moreover, we note that ``ridge smootenhing'' is often of limited effect already for over-parametrization ratios above $5$ and of virtually no effect for those above $10$. This solidifies the notion that one of the recently most popular neural networks paradigms -- \emph{zero-training (interpolating) generalizes well} -- enjoys wider applicability, including the one within the FRM estimation/prediction context.
Abstract:We consider \emph{random linear programs} (rlps) as a subclass of \emph{random optimization problems} (rops) and study their typical behavior. Our particular focus is on appropriate linear objectives which connect the rlps to the mean widths of random polyhedrons/polytopes. Utilizing the powerful machinery of \emph{random duality theory} (RDT) \cite{StojnicRegRndDlt10}, we obtain, in a large dimensional context, the exact characterizations of the program's objectives. In particular, for any $\alpha=\lim_{n\rightarrow\infty}\frac{m}{n}\in(0,\infty)$, any unit vector $\mathbf{c}\in{\mathbb R}^n$, any fixed $\mathbf{a}\in{\mathbb R}^n$, and $A\in {\mathbb R}^{m\times n}$ with iid standard normal entries, we have \begin{eqnarray*} \lim_{n\rightarrow\infty}{\mathbb P}_{A} \left ( (1-\epsilon) \xi_{opt}(\alpha;\mathbf{a}) \leq \min_{A\mathbf{x}\leq \mathbf{a}}\mathbf{c}^T\mathbf{x} \leq (1+\epsilon) \xi_{opt}(\alpha;\mathbf{a}) \right ) \longrightarrow 1, \end{eqnarray*} where \begin{equation*} \xi_{opt}(\alpha;\mathbf{a}) \triangleq \min_{x>0} \sqrt{x^2- x^2 \lim_{n\rightarrow\infty} \frac{\sum_{i=1}^{m} \left ( \frac{1}{2} \left (\left ( \frac{\mathbf{a}_i}{x}\right )^2 + 1\right ) \mbox{erfc}\left( \frac{\mathbf{a}_i}{x\sqrt{2}}\right ) - \frac{\mathbf{a}_i}{x} \frac{e^{-\frac{\mathbf{a}_i^2}{2x^2}}}{\sqrt{2\pi}} \right ) }{n} }. \end{equation*} For example, for $\mathbf{a}=\mathbf{1}$, one uncovers \begin{equation*} \xi_{opt}(\alpha) = \min_{x>0} \sqrt{x^2- x^2 \alpha \left ( \frac{1}{2} \left ( \frac{1}{x^2} + 1\right ) \mbox{erfc} \left ( \frac{1}{x\sqrt{2}}\right ) - \frac{1}{x} \frac{e^{-\frac{1}{2x^2}}}{\sqrt{2\pi}} \right ) }. \end{equation*} Moreover, $2 \xi_{opt}(\alpha)$ is precisely the concentrating point of the mean width of the polyhedron $\{\mathbf{x}|A\mathbf{x} \leq \mathbf{1}\}$.
Abstract:In \cite{Hop82}, Hopfield introduced a \emph{Hebbian} learning rule based neural network model and suggested how it can efficiently operate as an associative memory. Studying random binary patterns, he also uncovered that, if a small fraction of errors is tolerated in the stored patterns retrieval, the capacity of the network (maximal number of memorized patterns, $m$) scales linearly with each pattern's size, $n$. Moreover, he famously predicted $\alpha_c=\lim_{n\rightarrow\infty}\frac{m}{n}\approx 0.14$. We study this very same scenario with two famous pattern's basins of attraction: \textbf{\emph{(i)}} The AGS one from \cite{AmiGutSom85}; and \textbf{\emph{(ii)}} The NLT one from \cite{Newman88,Louk94,Louk94a,Louk97,Tal98}. Relying on the \emph{fully lifted random duality theory} (fl RDT) from \cite{Stojnicflrdt23}, we obtain the following explicit capacity characterizations on the first level of lifting: \begin{equation} \alpha_c^{(AGS,1)} = \left ( \max_{\delta\in \left ( 0,\frac{1}{2}\right ) }\frac{1-2\delta}{\sqrt{2} \mbox{erfinv} \left ( 1-2\delta\right )} - \frac{2}{\sqrt{2\pi}} e^{-\left ( \mbox{erfinv}\left ( 1-2\delta \right )\right )^2}\right )^2 \approx \mathbf{0.137906} \end{equation} \begin{equation} \alpha_c^{(NLT,1)} = \frac{\mbox{erf}(x)^2}{2x^2}-1+\mbox{erf}(x)^2 \approx \mathbf{0.129490}, \quad 1-\mbox{erf}(x)^2- \frac{2\mbox{erf}(x)e^{-x^2}}{\sqrt{\pi}x}+\frac{2e^{-2x^2}}{\pi}=0. \end{equation} A substantial numerical work gives on the second level of lifting $\alpha_c^{(AGS,2)} \approx \mathbf{0.138186}$ and $\alpha_c^{(NLT,2)} \approx \mathbf{0.12979}$, effectively uncovering a remarkably fast lifting convergence. Moreover, the obtained AGS characterizations exactly match the replica symmetry based ones of \cite{AmiGutSom85} and the corresponding symmetry breaking ones of \cite{SteKuh94}.
Abstract:Recent progress in studying \emph{treelike committee machines} (TCM) neural networks (NN) in \cite{Stojnictcmspnncaprdt23,Stojnictcmspnncapliftedrdt23,Stojnictcmspnncapdiffactrdt23} showed that the Random Duality Theory (RDT) and its a \emph{partially lifted}(pl RDT) variant are powerful tools that can be used for very precise networks capacity analysis. Here, we consider \emph{wide} hidden layer networks and uncover that certain aspects of numerical difficulties faced in \cite{Stojnictcmspnncapdiffactrdt23} miraculously disappear. In particular, we employ recently developed \emph{fully lifted} (fl) RDT to characterize the \emph{wide} ($d\rightarrow \infty$) TCM nets capacity. We obtain explicit, closed form, capacity characterizations for a very generic class of the hidden layer activations. While the utilized approach significantly lowers the amount of the needed numerical evaluations, the ultimate fl RDT usefulness and success still require a solid portion of the residual numerical work. To get the concrete capacity values, we take four very famous activations examples: \emph{\textbf{ReLU}}, \textbf{\emph{quadratic}}, \textbf{\emph{erf}}, and \textbf{\emph{tanh}}. After successfully conducting all the residual numerical work for all of them, we uncover that the whole lifting mechanism exhibits a remarkably rapid convergence with the relative improvements no better than $\sim 0.1\%$ happening already on the 3-rd level of lifting. As a convenient bonus, we also uncover that the capacity characterizations obtained on the first and second level of lifting precisely match those obtained through the statistical physics replica theory methods in \cite{ZavPeh21} for the generic and in \cite{BalMalZech19} for the ReLU activations.
Abstract:We consider the capacity of \emph{treelike committee machines} (TCM) neural networks. Relying on Random Duality Theory (RDT), \cite{Stojnictcmspnncaprdt23} recently introduced a generic framework for their capacity analysis. An upgrade based on the so-called \emph{partially lifted} RDT (pl RDT) was then presented in \cite{Stojnictcmspnncapliftedrdt23}. Both lines of work focused on the networks with the most typical, \emph{sign}, activations. Here, on the other hand, we focus on networks with other, more general, types of activations and show that the frameworks of \cite{Stojnictcmspnncaprdt23,Stojnictcmspnncapliftedrdt23} are sufficiently powerful to enable handling of such scenarios as well. In addition to the standard \emph{linear} activations, we uncover that particularly convenient results can be obtained for two very commonly used activations, namely, the \emph{quadratic} and \emph{rectified linear unit (ReLU)} ones. In more concrete terms, for each of these activations, we obtain both the RDT and pl RDT based memory capacities upper bound characterization for \emph{any} given (even) number of the hidden layer neurons, $d$. In the process, we also uncover the following two, rather remarkable, facts: 1) contrary to the common wisdom, both sets of results show that the bounding capacity decreases for large $d$ (the width of the hidden layer) while converging to a constant value; and 2) the maximum bounding capacity is achieved for the networks with precisely \textbf{\emph{two}} hidden layer neurons! Moreover, the large $d$ converging values are observed to be in excellent agrement with the statistical physics replica theory based predictions.
Abstract:We consider the classical \emph{spherical} perceptrons and study their capacities. The famous zero-threshold case was solved in the sixties of the last century (see, \cite{Wendel62,Winder,Cover65}) through the high-dimensional combinatorial considerations. The general threshold, $\kappa$, case though turned out to be much harder and stayed out of reach for the following several decades. A substantial progress was then made in \cite{SchTir02} and \cite{StojnicGardGen13} where the \emph{positive} threshold ($\kappa\geq 0$) scenario was finally fully settled. While the negative counterpart ($\kappa\leq 0$) remained out of reach, \cite{StojnicGardGen13} did show that the random duality theory (RDT) is still powerful enough to provide excellent upper bounds. Moreover, in \cite{StojnicGardSphNeg13}, a \emph{partially lifted} RDT variant was considered and it was shown that the upper bounds of \cite{StojnicGardGen13} can be lowered. After recent breakthroughs in studying bilinearly indexed (bli) random processes in \cite{Stojnicsflgscompyx23,Stojnicnflgscompyx23}, \emph{fully lifted} random duality theory (fl RDT) was developed in \cite{Stojnicflrdt23}. We here first show that the \emph{negative spherical perceptrons} can be fitted into the frame of the fl RDT and then employ the whole fl RDT machinery to characterize the capacity. To be fully practically operational, the fl RDT requires a substantial numerical work. We, however, uncover remarkable closed form analytical relations among key lifting parameters. Such a discovery enables performing the needed numerical calculations to obtain concrete capacity values. We also observe that an excellent convergence (with the relative improvement $\sim 0.1\%$) is achieved already on the third (second non-trivial) level of the \emph{stationarized} full lifting.
Abstract:We study the capacity of \emph{sign} perceptrons neural networks (SPNN) and particularly focus on 1-hidden layer \emph{treelike committee machine} (TCM) architectures. Similarly to what happens in the case of a single perceptron neuron, it turns out that, in a statistical sense, the capacity of a corresponding multilayered network architecture consisting of multiple \emph{sign} perceptrons also undergoes the so-called phase transition (PT) phenomenon. This means: (i) for certain range of system parameters (size of data, number of neurons), the network can be properly trained to accurately memorize \emph{all} elements of the input dataset; and (ii) outside the region such a training does not exist. Clearly, determining the corresponding phase transition curve that separates these regions is an extraordinary task and among the most fundamental questions related to the performance of any network. Utilizing powerful mathematical engine called Random Duality Theory (RDT), we establish a generic framework for determining the upper bounds on the 1-hidden layer TCM SPNN capacity. Moreover, we do so for \emph{any} given (odd) number of neurons. We further show that the obtained results \emph{exactly} match the replica symmetry predictions of \cite{EKTVZ92,BHS92}, thereby proving that the statistical physics based results are not only nice estimates but also mathematically rigorous bounds as well. Moreover, for $d\leq 5$, we obtain the capacity values that improve on the best known rigorous ones of \cite{MitchDurb89}, thereby establishing a first, mathematically rigorous, progress in well over 30 years.