Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuo Xie

Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Mar 13, 2025

Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li

Figure 1 for Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Figure 2 for Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Figure 3 for Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Figure 4 for Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Abstract:We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.

Via

Access Paper or Ask Questions

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Dec 13, 2024

Shuo Xie, Fangzhi Zhu, Jiahui Wang, Lulu Wen, Wei Dai, Xiaowei Chen, Junxiong Zhu, Kai Zhou, Bo Zheng

Figure 1 for MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Figure 2 for MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Figure 3 for MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Figure 4 for MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Abstract:Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are inherently derived from PPO, requiring a reference model that adds GPU memory resources and relies heavily on abundant preference data. Meanwhile, current preference optimization research mainly targets single-question scenarios with two replies, neglecting optimization with multiple replies, which leads to a waste of data in the application. This study introduces the MPPO algorithm, which leverages the average likelihood of model responses to fit the reward function and maximizes the utilization of preference data. Through a comparison of Point-wise, Pair-wise, and List-wise implementations, we found that the Pair-wise approach achieves the best performance, significantly enhancing the quality of model responses. Experimental results demonstrate MPPO's outstanding performance across various benchmarks. On MT-Bench, MPPO outperforms DPO, ORPO, and SimPO. Notably, on Arena-Hard, MPPO surpasses DPO and ORPO by substantial margins. These achievements underscore the remarkable advantages of MPPO in preference optimization tasks.

* Accepted by COLING2025

Via

Access Paper or Ask Questions

Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

Oct 10, 2024

Shuo Xie, Mohamad Amin Mohamadi, Zhiyuan Li

$Figure 1 for Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity$

$Figure 2 for Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity$

$Figure 3 for Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity$

$Figure 4 for Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity$

Abstract:Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.

Via

Access Paper or Ask Questions

Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization

Apr 05, 2024

Shuo Xie, Zhiyuan Li

$Figure 1 for Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization$

$Figure 2 for Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization$

$Figure 3 for Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization$

$Figure 4 for Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization$

Abstract:Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.

Via

Access Paper or Ask Questions

WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine

Aug 30, 2023

Siqiao Xue, Fan Zhou, Yi Xu, Hongyu Zhao, Shuo Xie, Qingyang Dai, Caigao Jiang, James Zhang, Jun Zhou, Dacheng Xiu(+1 more)

Figure 1 for WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine

Figure 2 for WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine

Figure 3 for WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine

Figure 4 for WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine

Abstract:We present WeaverBird, an intelligent dialogue system designed specifically for the finance domain. Our system harnesses a large language model of GPT architecture that has been tuned using extensive corpora of finance-related text. As a result, our system possesses the capability to understand complex financial queries, such as "How should I manage my investments during inflation?", and provide informed responses. Furthermore, our system incorporates a local knowledge base and a search engine to retrieve relevant information. The final responses are conditioned on the search results and include proper citations to the sources, thus enjoying an enhanced credibility. Through a range of finance-related questions, we have demonstrated the superior performance of our system compared to other models. To experience our system firsthand, users can interact with our live demo at https://weaverbird.ttic.edu, as well as watch our 2-min video illustration at https://www.youtube.com/watch?v=fyV2qQkX6Tc.

Via

Access Paper or Ask Questions

Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning

Oct 19, 2022

Shuo Xie, Jiahao Qiu, Ankita Pasad, Li Du, Qing Qu, Hongyuan Mei

Figure 1 for Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning

Figure 2 for Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning

Figure 3 for Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning

Figure 4 for Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning

Abstract:While transferring a pretrained language model, common approaches conventionally attach their task-specific classifiers to the top layer and adapt all the pretrained layers. We investigate whether one could make a task-specific selection on which subset of the layers to adapt and where to place the classifier. The goal is to reduce the computation cost of transfer learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its performance. We propose to select layers based on the variability of their hidden states given a task-specific corpus. We say a layer is already "well-specialized" in a task if the within-class variability of its hidden states is low relative to the between-class variability. Our variability metric is cheap to compute and doesn't need any training or hyperparameter tuning. It is robust to data imbalance and data scarcity. Extensive experiments on the GLUE benchmark demonstrate that selecting layers based on our metric can yield significantly stronger performance than using the same number of top layers and often match the performance of fine-tuning or adapter-tuning the entire language model.

* EMNLP 2022 camera-ready

Via

Access Paper or Ask Questions