Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tommaso Pegolotti

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Jul 07, 2023

Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

Figure 1 for QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Figure 2 for QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Figure 3 for QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Figure 4 for QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Abstract:We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.

Via

Access Paper or Ask Questions

SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

Feb 09, 2023

Mahdi Nikdan, Tommaso Pegolotti, Eugenia Iofinova, Eldar Kurtic, Dan Alistarh

Figure 1 for SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

Figure 2 for SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

Figure 3 for SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

Figure 4 for SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

Abstract:We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse. Our algorithm is general, as it applies to arbitrary (unstructured) sparsity and common layer types (e.g., convolutional or linear). We provide a fast vectorized implementation on commodity CPUs, and show that it can yield speedups in end-to-end runtime experiments, both in transfer learning using already-sparsified networks, and in training sparse networks from scratch. Thus, our results provide the first support for sparse training on commodity hardware.

Via

Access Paper or Ask Questions