Picture for Haihao Shen

Haihao Shen

Efficient LLM Inference on CPUs

Add code
Nov 01, 2023
Viaarxiv icon

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

Add code
Oct 17, 2023
Viaarxiv icon

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Add code
Sep 28, 2023
Viaarxiv icon

Efficient Post-training Quantization with FP8 Formats

Add code
Sep 26, 2023
Viaarxiv icon

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Add code
Jun 28, 2023
Viaarxiv icon

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Add code
Oct 31, 2022
Viaarxiv icon

Fast DistilBERT on CPUs

Add code
Oct 27, 2022
Viaarxiv icon

Prune Once for All: Sparse Pre-Trained Language Models

Add code
Nov 10, 2021
Figure 1 for Prune Once for All: Sparse Pre-Trained Language Models
Figure 2 for Prune Once for All: Sparse Pre-Trained Language Models
Figure 3 for Prune Once for All: Sparse Pre-Trained Language Models
Figure 4 for Prune Once for All: Sparse Pre-Trained Language Models
Viaarxiv icon

Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe

Add code
May 04, 2018
Figure 1 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Figure 2 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Figure 3 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Figure 4 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Viaarxiv icon