Abstract:We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is approximately a rank-two matrix. Empirically, we demonstrate that WD is a necessary condition for inducing this low-rank bias across both regression and classification tasks. Our work differs from previous studies as our theoretical analysis does not rely on common assumptions regarding the training data distribution, optimality of weight matrices, or specific training procedures. Furthermore, by leveraging the low-rank bias, we derive improved generalization error bounds and provide numerical evidence showing that better generalization can be achieved. Thus, our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
Abstract:In this paper, we introduce a new finite expression method (FEX) to solve high-dimensional partial integro-differential equations (PIDEs). This approach builds upon the original FEX and its inherent advantages with new advances: 1) A novel method of parameter grouping is proposed to reduce the number of coefficients in high-dimensional function approximation; 2) A Taylor series approximation method is implemented to significantly improve the computational efficiency and accuracy of the evaluation of the integral terms of PIDEs. The new FEX based method, denoted FEX-PG to indicate the addition of the parameter grouping (PG) step to the algorithm, provides both high accuracy and interpretable numerical solutions, with the outcome being an explicit equation that facilitates intuitive understanding of the underlying solution structures. These features are often absent in traditional methods, such as finite element methods (FEM) and finite difference methods, as well as in deep learning-based approaches. To benchmark our method against recent advances, we apply the new FEX-PG to solve benchmark PIDEs in the literature. In high-dimensional settings, FEX-PG exhibits strong and robust performance, achieving relative errors on the order of single precision machine epsilon.
Abstract:Learning to Optimize (L2O) approaches, including algorithm unrolling, plug-and-play methods, and hyperparameter learning, have garnered significant attention and have been successfully applied to the Alternating Direction Method of Multipliers (ADMM) and its variants. However, the natural extension of L2O to multi-block ADMM-type methods remains largely unexplored. Such an extension is critical, as multi-block methods leverage the separable structure of optimization problems, offering substantial reductions in per-iteration complexity. Given that classical multi-block ADMM does not guarantee convergence, the Majorized Proximal Augmented Lagrangian Method (MPALM), which shares a similar form with multi-block ADMM and ensures convergence, is more suitable in this setting. Despite its theoretical advantages, MPALM's performance is highly sensitive to the choice of penalty parameters. To address this limitation, we propose a novel L2O approach that adaptively selects this hyperparameter using supervised learning. We demonstrate the versatility and effectiveness of our method by applying it to the Lasso problem and the optimal transport problem. Our numerical results show that the proposed framework outperforms popular alternatives. Given its applicability to generic linearly constrained composite optimization problems, this work opens the door to a wide range of potential real-world applications.
Abstract:Recently, the authors of Shen Yang Zhang (JMLR, 2022) developed a neural network with width $36d(2d + 1)$ and depth $11$, which utilizes a special activation function called the elementary universal activation function, to achieve the super approximation property for functions in $C([a,b]^d)$. That is, the constructed network only requires a fixed number of neurons to approximate a $d$-variate continuous function on a $d$-dimensional hypercube with arbitrary accuracy. Their network uses $\mathcal{O}(d^2)$ fixed neurons. One natural question to address is whether we can reduce the number of these neurons in such a network. By leveraging a variant of the Kolmogorov Superposition Theorem, our analysis shows that there is a neural network generated by the elementary universal activation function with only $366d +365$ fixed, intrinsic (non-repeated) neurons that attains this super approximation property. Furthermore, we present a family of continuous functions that requires at least width $d$, and therefore at least $d$ intrinsic neurons, to achieve arbitrary accuracy in its approximation. This shows that the requirement of $\mathcal{O}(d)$ intrinsic neurons is optimal in the sense that it grows linearly with the input dimension $d$, unlike some approximation methods where parameters may grow exponentially with $d$.
Abstract:A vertex exchange method is proposed for solving the strongly convex quadratic program subject to the generalized simplex constraint. We conduct rigorous convergence analysis for the proposed algorithm and demonstrate its essential roles in solving some important classes of constrained convex optimization. To get a feasible initial point to execute the algorithm, we also present and analyze a highly efficient semismooth Newton method for computing the projection onto the generalized simplex. The excellent practical performance of the proposed algorithms is demonstrated by a set of extensive numerical experiments. Our theoretical and numerical results further motivate the potential applications of the considered model and the proposed algorithms.
Abstract:Human-designed algorithms have long been fundamental in solving a variety of scientific and engineering challenges. Recently, data-driven deep learning methods have also risen to prominence, offering innovative solutions across numerous scientific fields. While traditional algorithms excel in capturing the core aspects of specific problems, they often lack the flexibility needed for varying problem conditions due to the absence of specific data. Conversely, while data-driven approaches utilize vast datasets, they frequently fall short in domain-specific knowledge. To bridge these gaps, we introduce \textbf{FMint} (Foundation Model based on Initialization), a generative pre-trained model that synergizes the precision of human-designed algorithms with the adaptability of data-driven methods. This model is specifically engineered for high-accuracy simulation of dynamical systems. Starting from initial trajectories provided by conventional methods, FMint quickly delivers highly accurate solutions. It incorporates in-context learning and has been pre-trained on a diverse corpus of 500,000 dynamical systems, showcasing exceptional generalization across a broad spectrum of real-world applications. By effectively combining algorithmic rigor with data-driven flexibility, FMint sets the stage for the next generation of scientific foundation models, tackling complex problems with both efficiency and high accuracy.
Abstract:We consider a regularized expected reward optimization problem in the non-oblivious setting that covers many existing problems in reinforcement learning (RL). In order to solve such an optimization problem, we apply and analyze the classical stochastic proximal gradient method. In particular, the method has shown to admit an $O(\epsilon^{-4})$ sample complexity to an $\epsilon$-stationary point, under standard conditions. Since the variance of the classical stochastic gradient estimator is typically large which slows down the convergence, we also apply an efficient stochastic variance-reduce proximal gradient method with an importance sampling based ProbAbilistic Gradient Estimator (PAGE). To the best of our knowledge, the application of this method represents a novel approach in addressing the general regularized reward optimization problem. Our analysis shows that the sample complexity can be improved from $O(\epsilon^{-4})$ to $O(\epsilon^{-3})$ under additional conditions. Our results on the stochastic (variance-reduced) proximal gradient method match the sample complexity of their most competitive counterparts under similar settings in the RL literature.
Abstract:Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-making scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with $\mathcal{C}$-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for $\mathcal{C}$-mixing sequences and the neural network approximation theory for the H\"{o}lder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.
Abstract:Transition path theory (TPT) is a mathematical framework for quantifying rare transition events between a pair of selected metastable states $A$ and $B$. Central to TPT is the committor function, which describes the probability to hit the metastable state $B$ prior to $A$ from any given starting point of the phase space. Once the committor is computed, the transition channels and the transition rate can be readily found. The committor is the solution to the backward Kolmogorov equation with appropriate boundary conditions. However, solving it is a challenging task in high dimensions due to the need to mesh a whole region of the ambient space. In this work, we explore the finite expression method (FEX, Liang and Yang (2022)) as a tool for computing the committor. FEX approximates the committor by an algebraic expression involving a fixed finite number of nonlinear functions and binary arithmetic operations. The optimal nonlinear functions, the binary operations, and the numerical coefficients in the expression template are found via reinforcement learning. The FEX-based committor solver is tested on several high-dimensional benchmark problems. It gives comparable or better results than neural network-based solvers. Most importantly, FEX is capable of correctly identifying the algebraic structure of the solution which allows one to reduce the committor problem to a low-dimensional one and find the committor with any desired accuracy.
Abstract:Graph Signal Filter used as dimensionality reduction in spectral clustering usually requires expensive eigenvalue estimation. We analyze the filter in an optimization setting and propose to use four orthogonalization-free methods by optimizing objective functions as dimensionality reduction in spectral clustering. The proposed methods do not utilize any orthogonalization, which is known as not well scalable in a parallel computing environment. Our methods theoretically construct adequate feature space, which is, at most, a weighted alteration to the eigenspace of a normalized Laplacian matrix. We numerically hypothesize that the proposed methods are equivalent in clustering quality to the ideal Graph Signal Filter, which exploits the exact eigenvalue needed without expensive eigenvalue estimation. Numerical results show that the proposed methods outperform Power Iteration-based methods and Graph Signal Filter in clustering quality and computation cost. Unlike Power Iteration-based methods and Graph Signal Filter which require random signal input, our methods are able to utilize available initialization in the streaming graph scenarios. Additionally, numerical results show that our methods outperform ARPACK and are faster than LOBPCG in the streaming graph scenarios. We also present numerical results showing the scalability of our methods in multithreading and multiprocessing implementations to facilitate parallel spectral clustering.