Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ganzhao Yuan

AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Jun 17, 2025

Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

Abstract:Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.

Via

Access Paper or Ask Questions

Adaptive Accelerated Proximal Gradient Methods with Variance Reduction for Composite Nonconvex Finite-Sum Minimization

Feb 28, 2025

Ganzhao Yuan

Abstract:This paper proposes {\sf AAPG-SPIDER}, an Adaptive Accelerated Proximal Gradient (AAPG) method with variance reduction for minimizing composite nonconvex finite-sum functions. It integrates three acceleration techniques: adaptive stepsizes, Nesterov's extrapolation, and the recursive stochastic path-integrated estimator SPIDER. While targeting stochastic finite-sum problems, {\sf AAPG-SPIDER} simplifies to {\sf AAPG} in the full-batch, non-stochastic setting, which is also of independent interest. To our knowledge, {\sf AAPG-SPIDER} and {\sf AAPG} are the first learning-rate-free methods to achieve optimal iteration complexity for this class of \textit{composite} minimization problems. Specifically, {\sf AAPG} achieves the optimal iteration complexity of $\mathcal{O}(N \epsilon^{-2})$, while {\sf AAPG-SPIDER} achieves $\mathcal{O}(N + \sqrt{N} \epsilon^{-2})$ for finding $\epsilon$-approximate stationary points, where $N$ is the number of component functions. Under the Kurdyka-Lojasiewicz (KL) assumption, we establish non-ergodic convergence rates for both methods. Preliminary experiments on sparse phase retrieval and linear eigenvalue problems demonstrate the superior performance of {\sf AAPG-SPIDER} and {\sf AAPG} compared to existing methods.

Via

Access Paper or Ask Questions

AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for Selective Updates

Jan 30, 2025

Da Chang, Yu Li, Ganzhao Yuan

Abstract:In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based optimizers. Extensive experiments show that AlphaAdam outperforms state-of-the-art methods such as AdamW in terms of convergence speed and computational efficiency across tasks, including GPT-2 pre-trained and fine-tuned RoBERTa and Llama-7B. Our AlphaAdam implements an optimizer enhancement framework for LLMs through intra-layer asynchronous masked adaptive updates. Our code is available in this \href{https://github.com/MaeChd/AlphaAdam}{link}

Via

Access Paper or Ask Questions

ADMM for Structured Fractional Minimization

Nov 12, 2024

Ganzhao Yuan

Figure 1 for ADMM for Structured Fractional Minimization

Figure 2 for ADMM for Structured Fractional Minimization

Figure 3 for ADMM for Structured Fractional Minimization

Figure 4 for ADMM for Structured Fractional Minimization

Abstract:We consider a class of structured fractional minimization problems, where the numerator includes a differentiable function, a simple nonconvex nonsmooth function, a concave nonsmooth function, and a convex nonsmooth function composed with a linear operator, while the denominator is a continuous function that is either weakly convex or has a weakly convex square root. These problems are widespread and span numerous essential applications in machine learning and data science. Existing methods are mainly based on subgradient methods and smoothing proximal gradient methods, which may suffer from slow convergence and numerical stability issues. In this paper, we introduce {\sf FADMM}, the first Alternating Direction Method of Multipliers tailored for this class of problems. {\sf FADMM} decouples the original problem into linearized proximal subproblems, featuring two variants: one using Dinkelbach's parametric method ({\sf FADMM-D}) and the other using the quadratic transform method ({\sf FADMM-Q}). By introducing a novel Lyapunov function, we establish that {\sf FADMM} converges to $\epsilon$-approximate critical points of the problem within an oracle complexity of $\mathcal{O}(1/\epsilon^{3})$. Our experiments on synthetic and real-world data for sparse Fisher discriminant analysis, robust Sharpe ratio minimization, and robust sparse recovery demonstrate the effectiveness of our approach. Keywords: Fractional Minimization, Nonconvex Optimization, Proximal Linearized ADMM, Nonsmooth Optimization, Convergence Analysis

Via

Access Paper or Ask Questions

A Block Coordinate Descent Method for Nonsmooth Composite Optimization under Orthogonality Constraints

Apr 07, 2023

Ganzhao Yuan

Abstract:Nonsmooth composite optimization with orthogonality constraints has a broad spectrum of applications in statistical learning and data science. However, this problem is generally challenging to solve due to its non-convex and non-smooth nature. Existing solutions are limited by one or more of the following restrictions: (i) they are full gradient methods that require high computational costs in each iteration; (ii) they are not capable of solving general nonsmooth composite problems; (iii) they are infeasible methods and can only achieve the feasibility of the solution at the limit point; (iv) they lack rigorous convergence guarantees; (v) they only obtain weak optimality of critical points. In this paper, we propose \textit{\textbf{OBCD}}, a new Block Coordinate Descent method for solving general nonsmooth composite problems under Orthogonality constraints. \textit{\textbf{OBCD}} is a feasible method with low computation complexity footprints. In each iteration, our algorithm updates $k$ rows of the solution matrix ($k\geq2$ is a parameter) to preserve the constraints. Then, it solves a small-sized nonsmooth composite optimization problem under orthogonality constraints either exactly or approximately. We demonstrate that any exact block-$k$ stationary point is always an approximate block-$k$ stationary point, which is equivalent to the critical stationary point. We are particularly interested in the case where $k=2$ as the resulting subproblem reduces to a one-dimensional nonconvex problem. We propose a breakpoint searching method and a fifth-order iterative method to solve this problem efficiently and effectively. We also propose two novel greedy strategies to find a good working set to further accelerate the convergence of \textit{\textbf{OBCD}}. Finally, we have conducted extensive experiments on several tasks to demonstrate the superiority of our approach.

Via

Access Paper or Ask Questions

Coordinate Descent Methods for Fractional Minimization

Jan 30, 2022

Ganzhao Yuan

Figure 1 for Coordinate Descent Methods for Fractional Minimization

Figure 2 for Coordinate Descent Methods for Fractional Minimization

Figure 3 for Coordinate Descent Methods for Fractional Minimization

Figure 4 for Coordinate Descent Methods for Fractional Minimization

Abstract:We consider a class of structured fractional minimization problems, in which the numerator part of the objective is the sum of a differentiable convex function and a convex nonsmooth function, while the denominator part is a concave or convex function. This problem is difficult to solve since it is nonconvex. By exploiting the structure of the problem, we propose two Coordinate Descent (CD) methods for solving this problem. One is applied to the original fractional function, the other is based on the associated parametric problem. The proposed methods iteratively solve a one-dimensional subproblem \textit{globally}, and they are guaranteed to converge to coordinate-wise stationary points. In the case of a convex denominator, we prove that the proposed CD methods using sequential nonconvex approximation find stronger stationary points than existing methods. Under suitable conditions, CD methods with an appropriate initialization converge linearly to the optimal point (also the coordinate-wise stationary point). In the case of a concave denominator, we show that the resulting problem is quasi-convex, and any critical point is a global minimum. We prove that the algorithms converge to the global optimal solution with a sublinear convergence rate. We demonstrate the applicability of the proposed methods to some machine learning and signal processing models. Our experiments on real-world data have shown that our method significantly and consistently outperforms existing methods in terms of accuracy.

* arXiv admin note: text overlap with arXiv:2109.04228

Via

Access Paper or Ask Questions

Coordinate Descent Methods for DC Minimization

Sep 09, 2021

Ganzhao Yuan

Figure 1 for Coordinate Descent Methods for DC Minimization

Figure 2 for Coordinate Descent Methods for DC Minimization

Figure 3 for Coordinate Descent Methods for DC Minimization

Figure 4 for Coordinate Descent Methods for DC Minimization

Abstract:Difference-of-Convex (DC) minimization, referring to the problem of minimizing the difference of two convex functions, has been found rich applications in statistical learning and studied extensively for decades. However, existing methods are primarily based on multi-stage convex relaxation, only leading to weak optimality of critical points. This paper proposes a coordinate descent method for minimizing DC functions based on sequential nonconvex approximation. Our approach iteratively solves a nonconvex one-dimensional subproblem globally, and it is guaranteed to converge to a coordinate-wise stationary point. We prove that this new optimality condition is always stronger than the critical point condition and the directional point condition when the objective function is weakly convex. For comparisons, we also include a naive variant of coordinate descent methods based on sequential convex approximation in our study. When the objective function satisfies an additional regularity condition called \emph{sharpness}, coordinate descent methods with an appropriate initialization converge \emph{linearly} to the optimal solution set. Also, for many applications of interest, we show that the nonconvex one-dimensional subproblem can be computed exactly and efficiently using a breakpoint searching method. We present some discussions and extensions of our proposed method. Finally, we have conducted extensive experiments on several statistical learning tasks to show the superiority of our approach. Keywords: Coordinate Descent, DC Minimization, DC Programming, Difference-of-Convex Programs, Nonconvex Optimization, Sparse Optimization, Binary Optimization.

Via

Access Paper or Ask Questions

A Coordinate-wise Optimization Algorithm for Sparse Inverse Covariance Selection

Apr 04, 2018

Ganzhao Yuan, Haoxian Tan, Wei-Shi Zheng

Figure 1 for A Coordinate-wise Optimization Algorithm for Sparse Inverse Covariance Selection

Figure 2 for A Coordinate-wise Optimization Algorithm for Sparse Inverse Covariance Selection

Abstract:Sparse inverse covariance selection is a fundamental problem for analyzing dependencies in high dimensional data. However, such a problem is difficult to solve since it is NP-hard. Existing solutions are primarily based on convex approximation and iterative hard thresholding, which only lead to sub-optimal solutions. In this work, we propose a coordinate-wise optimization algorithm to solve this problem which is guaranteed to converge to a coordinate-wise minimum point. The algorithm iteratively and greedily selects one variable or swaps two variables to identify the support set, and then solves a reduced convex optimization problem over the support set to achieve the greatest descent. As a side contribution of this paper, we propose a Newton-like algorithm to solve the reduced convex sub-problem, which is proven to always converge to the optimal solution with global linear convergence rate and local quadratic convergence rate. Finally, we demonstrate the efficacy of our method on synthetic data and real-world data sets. As a result, the proposed method consistently outperforms existing solutions in terms of accuracy.

Via

Access Paper or Ask Questions

Convex Optimization for Linear Query Processing under Approximate Differential Privacy

May 16, 2016

Ganzhao Yuan, Yin Yang, Zhenjie Zhang, Zhifeng Hao

Figure 1 for Convex Optimization for Linear Query Processing under Approximate Differential Privacy

Figure 2 for Convex Optimization for Linear Query Processing under Approximate Differential Privacy

Figure 3 for Convex Optimization for Linear Query Processing under Approximate Differential Privacy

Figure 4 for Convex Optimization for Linear Query Processing under Approximate Differential Privacy

Abstract:Differential privacy enables organizations to collect accurate aggregates over sensitive data with strong, rigorous guarantees on individuals' privacy. Previous work has found that under differential privacy, computing multiple correlated aggregates as a batch, using an appropriate \emph{strategy}, may yield higher accuracy than computing each of them independently. However, finding the best strategy that maximizes result accuracy is non-trivial, as it involves solving a complex constrained optimization program that appears to be non-linear and non-convex. Hence, in the past much effort has been devoted in solving this non-convex optimization program. Existing approaches include various sophisticated heuristics and expensive numerical solutions. None of them, however, guarantees to find the optimal solution of this optimization problem. This paper points out that under ($\epsilon$, $\delta$)-differential privacy, the optimal solution of the above constrained optimization problem in search of a suitable strategy can be found, rather surprisingly, by solving a simple and elegant convex optimization program. Then, we propose an efficient algorithm based on Newton's method, which we prove to always converge to the optimal solution with linear global convergence rate and quadratic local convergence rate. Empirical evaluations demonstrate the accuracy and efficiency of the proposed solution.

* to appear in ACM SIGKDD 2016

Via

Access Paper or Ask Questions