Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Ran

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

May 07, 2025

Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang(+64 more)

Abstract:Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.

Via

Access Paper or Ask Questions

Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution

Dec 11, 2023

Binxiao Huang, Jason Chun Lok Li, Jie Ran, Boyu Li, Jiajun Zhou, Dahai Yu, Ngai Wong

Abstract:Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup table (LUT)-based SR schemes that employ simple LUT readout and largely elude CNN computation. Nonetheless, the multi-megabyte LUTs in existing methods still prohibit on-chip storage and necessitate off-chip memory transport. This work tackles this storage hurdle and innovates hundred-kilobyte LUT (HKLUT) models amenable to on-chip cache. Utilizing an asymmetric two-branch multistage network coupled with a suite of specialized kernel patterns, HKLUT demonstrates an uncompromising performance and superior hardware efficiency over existing LUT schemes.

Via

Access Paper or Ask Questions

Lite it fly: An All-Deformable-Butterfly Network

Nov 14, 2023

Rui Lin, Jason Chun Lok Li, Jiajun Zhou, Binxiao Huang, Jie Ran, Ngai Wong

Figure 1 for Lite it fly: An All-Deformable-Butterfly Network

Figure 2 for Lite it fly: An All-Deformable-Butterfly Network

Figure 3 for Lite it fly: An All-Deformable-Butterfly Network

Figure 4 for Lite it fly: An All-Deformable-Butterfly Network

Abstract:Most deep neural networks (DNNs) consist fundamentally of convolutional and/or fully connected layers, wherein the linear transform can be cast as the product between a filter matrix and a data matrix obtained by arranging feature tensors into columns. The lately proposed deformable butterfly (DeBut) decomposes the filter matrix into generalized, butterflylike factors, thus achieving network compression orthogonal to the traditional ways of pruning or low-rank decomposition. This work reveals an intimate link between DeBut and a systematic hierarchy of depthwise and pointwise convolutions, which explains the empirically good performance of DeBut layers. By developing an automated DeBut chain generator, we show for the first time the viability of homogenizing a DNN into all DeBut layers, thus achieving an extreme sparsity and compression. Various examples and hardware benchmarks verify the advantages of All-DeBut networks. In particular, we show it is possible to compress a PointNet to < 5% parameters with < 5% accuracy drop, a record not achievable by other compression schemes.

* 7 pages, 3 figures, accepted as a brief paper in IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

Via

Access Paper or Ask Questions

PECAN: A Product-Quantized Content Addressable Memory Network

Aug 13, 2022

Jie Ran, Rui Lin, Jason Chun Lok Li, Jiajun Zhou, Ngai Wong

Figure 1 for PECAN: A Product-Quantized Content Addressable Memory Network

Figure 2 for PECAN: A Product-Quantized Content Addressable Memory Network

Figure 3 for PECAN: A Product-Quantized Content Addressable Memory Network

Figure 4 for PECAN: A Product-Quantized Content Addressable Memory Network

Abstract:A novel deep neural network (DNN) architecture is proposed wherein the filtering and linear transform are realized solely with product quantization (PQ). This results in a natural implementation via content addressable memory (CAM), which transcends regular DNN layer operations and requires only simple table lookup. Two schemes are developed for the end-to-end PQ prototype training, namely, through angle- and distance-based similarities, which differ in their multiplicative and additive natures with different complexity-accuracy tradeoffs. Even more, the distance-based scheme constitutes a truly multiplier-free DNN solution. Experiments confirm the feasibility of such Product-Quantized Content Addressable Memory Network (PECAN), which has strong implication on hardware-efficient deployments especially for in-memory computing.

Via

Access Paper or Ask Questions

Deformable Butterfly: A Highly Structured and Sparse Linear Transform

Mar 25, 2022

Rui Lin, Jie Ran, King Hung Chiu, Graziano Chesi, Ngai Wong

Figure 1 for Deformable Butterfly: A Highly Structured and Sparse Linear Transform

Figure 2 for Deformable Butterfly: A Highly Structured and Sparse Linear Transform

Figure 3 for Deformable Butterfly: A Highly Structured and Sparse Linear Transform

Figure 4 for Deformable Butterfly: A Highly Structured and Sparse Linear Transform

Abstract:We introduce a new kind of linear transform named Deformable Butterfly (DeBut) that generalizes the conventional butterfly matrices and can be adapted to various input-output dimensions. It inherits the fine-to-coarse-grained learnable hierarchy of traditional butterflies and when deployed to neural networks, the prominent structures and sparsity in a DeBut layer constitutes a new way for network compression. We apply DeBut as a drop-in replacement of standard fully connected and convolutional layers, and demonstrate its superiority in homogenizing a neural network and rendering it favorable properties such as light weight and low inference complexity, without compromising accuracy. The natural complexity-accuracy tradeoff arising from the myriad deformations of a DeBut layer also opens up new rooms for analytical and practical research. The codes and Appendix are publicly available at: https://github.com/ruilin0212/DeBut.

Via

Access Paper or Ask Questions

EZCrop: Energy-Zoned Channels for Robust Output Pruning

May 11, 2021

Rui Lin, Jie Ran, Dongpeng Wang, King Hung Chiu, Ngai Wong

Figure 1 for EZCrop: Energy-Zoned Channels for Robust Output Pruning

Figure 2 for EZCrop: Energy-Zoned Channels for Robust Output Pruning

Figure 3 for EZCrop: Energy-Zoned Channels for Robust Output Pruning

Figure 4 for EZCrop: Energy-Zoned Channels for Robust Output Pruning

Abstract:Recent results have revealed an interesting observation in a trained convolutional neural network (CNN), namely, the rank of a feature map channel matrix remains surprisingly constant despite the input images. This has led to an effective rank-based channel pruning algorithm, yet the constant rank phenomenon remains mysterious and unexplained. This work aims at demystifying and interpreting such rank behavior from a frequency-domain perspective, which as a bonus suggests an extremely efficient Fast Fourier Transform (FFT)-based metric for measuring channel importance without explicitly computing its rank. We achieve remarkable CNN channel pruning based on this analytically sound and computationally efficient metric and adopt it for repetitive pruning to demonstrate robustness via our scheme named Energy-Zoned Channels for Robust Output Pruning (EZCrop), which shows consistently better results than other state-of-the-art channel pruning methods.

Via

Access Paper or Ask Questions

Exploiting Elasticity in Tensor Ranks for Compressing Neural Networks

May 10, 2021

Jie Ran, Rui Lin, Hayden K. H. So, Graziano Chesi, Ngai Wong

Figure 1 for Exploiting Elasticity in Tensor Ranks for Compressing Neural Networks

Figure 2 for Exploiting Elasticity in Tensor Ranks for Compressing Neural Networks

Figure 3 for Exploiting Elasticity in Tensor Ranks for Compressing Neural Networks

Figure 4 for Exploiting Elasticity in Tensor Ranks for Compressing Neural Networks

Abstract:Elasticities in depth, width, kernel size and resolution have been explored in compressing deep neural networks (DNNs). Recognizing that the kernels in a convolutional neural network (CNN) are 4-way tensors, we further exploit a new elasticity dimension along the input-output channels. Specifically, a novel nuclear-norm rank minimization factorization (NRMF) approach is proposed to dynamically and globally search for the reduced tensor ranks during training. Correlation between tensor ranks across multiple layers is revealed, and a graceful tradeoff between model size and accuracy is obtained. Experiments then show the superiority of NRMF over the previous non-elastic variational Bayesian matrix factorization (VBMF) scheme.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions