Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ron Banner

FP4 All the Way: Fully Quantized Training of LLMs

May 25, 2025

Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry

Abstract:We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

Via

Access Paper or Ask Questions

EXAQ: Exponent Aware Quantization For LLMs Acceleration

Oct 04, 2024

Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy

Figure 1 for EXAQ: Exponent Aware Quantization For LLMs Acceleration

Figure 2 for EXAQ: Exponent Aware Quantization For LLMs Acceleration

Figure 3 for EXAQ: Exponent Aware Quantization For LLMs Acceleration

Figure 4 for EXAQ: Exponent Aware Quantization For LLMs Acceleration

Abstract:Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both $e^x$ and $\sum(e^x)$ with minimal to no accuracy degradation. For example, in LLaMA1-30B, we achieve baseline performance with 2-bit quantization on the well-known "Physical Interaction: Question Answering" (PIQA) dataset evaluation. This ultra-low bit quantization allows, for the first time, an acceleration of approximately 4x in the accumulation phase. The combination of accelerating both $e^x$ and $\sum(e^x)$ results in a 36.9% acceleration in the softmax operation.

Via

Access Paper or Ask Questions

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Jun 18, 2023

Niv Giladi, Shahar Gottlieb, Moran Shkolnik, Asaf Karnieli, Ron Banner, Elad Hoffer, Kfir Yehuda Levy, Daniel Soudry

Abstract:Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers. Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

Via

Access Paper or Ask Questions

Optimal Fine-Grained N:M sparsity for Activations and Neural Gradients

Mar 21, 2022

Brian Chmiel, Itay Hubara, Ron Banner, Daniel Soudry

Figure 1 for Optimal Fine-Grained N:M sparsity for Activations and Neural Gradients

Figure 2 for Optimal Fine-Grained N:M sparsity for Activations and Neural Gradients

Figure 3 for Optimal Fine-Grained N:M sparsity for Activations and Neural Gradients

Figure 4 for Optimal Fine-Grained N:M sparsity for Activations and Neural Gradients

Abstract:In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) by x2, and doubles throughput by skipping computation of zero values. So far, it was only used to prune weights. We examine how this method can be used also for activations and their gradients (i.e., "neural gradients"). To this end, we first establish tensor-level optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the activations, it catastrophically fails for the neural gradients. Instead, we show that optimal pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to speed up training even more. A reference implementation is supplied in https://github.com/brianchmiel/Act-and-Grad-structured-sparsity.

Via

Access Paper or Ask Questions

Energy awareness in low precision neural networks

Feb 06, 2022

Nurit Spingarn Eliezer, Ron Banner, Elad Hoffer, Hilla Ben-Yaakov, Tomer Michaeli

Figure 1 for Energy awareness in low precision neural networks

Figure 2 for Energy awareness in low precision neural networks

Figure 3 for Energy awareness in low precision neural networks

Figure 4 for Energy awareness in low precision neural networks

Abstract:Power consumption is a major obstacle in the deployment of deep neural networks (DNNs) on end devices. Existing approaches for reducing power consumption rely on quite general principles, including avoidance of multiplication operations and aggressive quantization of weights and activations. However, these methods do not take into account the precise power consumed by each module in the network, and are therefore not optimal. In this paper we develop accurate power consumption models for all arithmetic operations in the DNN, under various working conditions. We reveal several important factors that have been overlooked to date. Based on our analysis, we present PANN (power-aware neural network), a simple approach for approximating any full-precision network by a low-power fixed-precision variant. Our method can be applied to a pre-trained network, and can also be used during training to achieve improved performance. In contrast to previous methods, PANN incurs only a minor degradation in accuracy w.r.t. the full-precision version of the network, even when working at the power-budget of a 2-bit quantized variant. In addition, our scheme enables to seamlessly traverse the power-accuracy trade-off at deployment time, which is a major advantage over existing quantization methods that are constrained to specific bit widths.

Via

Access Paper or Ask Questions

On Recoverability of Graph Neural Network Representations

Jan 30, 2022

Maxim Fishman, Chaim Baskin, Evgenii Zheltonozhskii, Ron Banner, Avi Mendelson

Figure 1 for On Recoverability of Graph Neural Network Representations

Figure 2 for On Recoverability of Graph Neural Network Representations

Figure 3 for On Recoverability of Graph Neural Network Representations

Figure 4 for On Recoverability of Graph Neural Network Representations

Abstract:Despite their growing popularity, graph neural networks (GNNs) still have multiple unsolved problems, including finding more expressive aggregation methods, propagation of information to distant nodes, and training on large-scale graphs. Understanding and solving such problems require developing analytic tools and techniques. In this work, we propose the notion of recoverability, which is tightly related to information aggregation in GNNs, and based on this concept, develop the method for GNN embedding analysis. We define recoverability theoretically and propose a method for its efficient empirical estimation. We demonstrate, through extensive experimental results on various datasets and different GNN architectures, that estimated recoverability correlates with aggregation method expressivity and graph sparsification quality. Therefore, we believe that the proposed method could provide an essential tool for understanding the roots of the aforementioned problems, and potentially lead to a GNN design that overcomes them. The code to reproduce our experiments is available at https://github.com/Anonymous1252022/Recoverability

Via

Access Paper or Ask Questions

Logarithmic Unbiased Quantization: Practical 4-bit Training in Deep Learning

Dec 19, 2021

Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, Daniel Soudry

Figure 1 for Logarithmic Unbiased Quantization: Practical 4-bit Training in Deep Learning

Figure 2 for Logarithmic Unbiased Quantization: Practical 4-bit Training in Deep Learning

Figure 3 for Logarithmic Unbiased Quantization: Practical 4-bit Training in Deep Learning

Figure 4 for Logarithmic Unbiased Quantization: Practical 4-bit Training in Deep Learning

Abstract:Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. In this work, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how. Based on this, we suggest a $\textit{logarithmic unbiased quantization}$ (LUQ) method to quantize both the forward and backward phase to 4-bit, achieving state-of-the-art results in 4-bit training without overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.18%. We further improve this to degradation of only 0.64% after a single epoch of high precision fine-tuning combined with a variance reduction method -- both add overhead comparable to previously suggested methods. Finally, we suggest a method that uses the low precision format to avoid multiplications during two-thirds of the training process, thus reducing by 5x the area used by the multiplier.

Via

Access Paper or Ask Questions

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Feb 16, 2021

Itay Hubara, Brian Chmiel, Moshe Island, Ron Banner, Seffi Naor, Daniel Soudry

Figure 1 for Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Figure 2 for Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Figure 3 for Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Figure 4 for Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Abstract:Recently, researchers proposed pruning deep neural network weights (DNNs) using an $N:M$ fine-grained block sparsity mask. In this mask, for each block of $M$ weights, we have at least $N$ zeros. In contrast to unstructured sparsity, $N:M$ fine-grained block sparsity allows acceleration in actual modern hardware. So far, this was used for DNN acceleration at the inference phase. First, we suggest a method to convert a pretrained model with unstructured sparsity to a $N:M$ fine-grained block sparsity model, with little to no training. Then, to also allow such acceleration in the training phase, we suggest a novel transposable-fine-grained sparsity mask where the same mask can be used for both forward and backward passes. Our transposable mask ensures that both the weight matrix and its transpose follow the same sparsity pattern; thus the matrix multiplication required for passing the error backward can also be accelerated. We discuss the transposable constraint and devise a new measure for mask constraints, called mask-diversity (MD), which correlates with their expected accuracy. Then, we formulate the problem of finding the optimal transposable mask as a minimum-cost-flow problem and suggest a fast linear approximation that can be used when the masks dynamically change while training. Our experiments suggest 2x speed-up with no accuracy degradation over vision and language models. A reference implementation can be found at https://github.com/papers-submission/structured_transposable_masks.

Via

Access Paper or Ask Questions

GAN Steerability without optimization

Dec 09, 2020

Nurit Spingarn-Eliezer, Ron Banner, Tomer Michaeli

Figure 1 for GAN Steerability without optimization

Figure 2 for GAN Steerability without optimization

Figure 3 for GAN Steerability without optimization

Figure 4 for GAN Steerability without optimization

Abstract:Recent research has shown remarkable success in revealing "steering" directions in the latent spaces of pre-trained GANs. These directions correspond to semantically meaningful image transformations e.g., shift, zoom, color manipulations), and have similar interpretable effects across all categories that the GAN can generate. Some methods focus on user-specified transformations, while others discover transformations in an unsupervised manner. However, all existing techniques rely on an optimization procedure to expose those directions, and offer no control over the degree of allowed interaction between different transformations. In this paper, we show that "steering" trajectories can be computed in closed form directly from the generator's weights without any form of training or optimization. This applies to user-prescribed geometric transformations, as well as to unsupervised discovery of more complex effects. Our approach allows determining both linear and nonlinear trajectories, and has many advantages over previous methods. In particular, we can control whether one transformation is allowed to come on the expense of another (e.g. zoom-in with or without allowing translation to keep the object centered). Moreover, we can determine the natural end-point of the trajectory, which corresponds to the largest extent to which a transformation can be applied without incurring degradation. Finally, we show how transferring attributes between images can be achieved without optimization, even across different categories.

Via

Access Paper or Ask Questions

Neural gradients are lognormally distributed: understanding sparse and quantized training

Jun 17, 2020

Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, Daniel Soudry

Figure 1 for Neural gradients are lognormally distributed: understanding sparse and quantized training

Figure 2 for Neural gradients are lognormally distributed: understanding sparse and quantized training

Figure 3 for Neural gradients are lognormally distributed: understanding sparse and quantized training

Figure 4 for Neural gradients are lognormally distributed: understanding sparse and quantized training

Abstract:Neural gradient compression remains a main bottleneck in improving training efficiency, as most existing neural network compression methods (e.g., pruning or quantization) focus on weights, activations, and weight gradients. However, these methods are not suitable for compressing neural gradients, which have a very different distribution. Specifically, we find that the neural gradients follow a lognormal distribution. Taking this into account, we suggest two methods to reduce the computational and memory burdens of neural gradients. The first one is stochastic gradient pruning, which can accurately set the sparsity level -- up to 85% gradient sparsity without hurting validation accuracy (ResNet18 on ImageNet). The second method determines the floating-point format for low numerical precision gradients (e.g., FP8). Our results shed light on previous findings related to local scaling, the optimal bit-allocation for the mantissa and exponent, and challenging workloads for which low-precision floating-point arithmetic has reported to fail. Reference implementation accompanies the paper.

* Fix references typos

Via

Access Paper or Ask Questions