Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xi Jin

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs

Mar 08, 2025

Qizhe Wu, Huawen Liang, Yuchen Gui, Zhichen Zeng, Zerong He, Linfeng Tao, Xiaotian Wang, Letian Zhao, Zhaoxi Zeng, Wei Yuan(+2 more)

Abstract:General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or operand reuse strategies. However, considering the interaction between matrix multiplication and multiply-accumulators (MACs) offers greater optimization potential. This work introduces a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of MACs. We propose a finer-grained TPE notation using matrix triple loops as an example, introducing new methods for designing and optimizing PE microarchitectures. Based on this notation and its transformations, we propose four optimization techniques that improve timing, area, and power consumption. Implementing our design in RTL using the SMIC-28nm process, we evaluate its effectiveness across four classic TPE architectures: systolic array, 3D-Cube, multiplier-adder tree, and 2D-Matrix. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively. Applied to a bit-slice architecture, our approach achieves a 12.10x improvement in energy efficiency and 2.85x in area efficiency compared to Laconic. Our Verilog HDL code, along with timing, area, and power reports, is available at https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines

Via

Access Paper or Ask Questions

Hardware Acceleration of Sampling Algorithms in Sample and Aggregate Graph Neural Networks

Sep 07, 2022

Yuchen Gui, Boyi Wei, Wei Yuan, Xi Jin

Figure 1 for Hardware Acceleration of Sampling Algorithms in Sample and Aggregate Graph Neural Networks

Figure 2 for Hardware Acceleration of Sampling Algorithms in Sample and Aggregate Graph Neural Networks

Figure 3 for Hardware Acceleration of Sampling Algorithms in Sample and Aggregate Graph Neural Networks

Figure 4 for Hardware Acceleration of Sampling Algorithms in Sample and Aggregate Graph Neural Networks

Abstract:Sampling is an important process in many GNN structures in order to train larger datasets with a smaller computational complexity. However, compared to other processes in GNN (such as aggregate, backward propagation), the sampling process still costs tremendous time, which limits the speed of training. To reduce the time of sampling, hardware acceleration is an ideal choice. However, state of the art GNN acceleration proposal did not specify how to accelerate the sampling process. What's more, directly accelerating traditional sampling algorithms will make the structure of the accelerator very complicated. In this work, we made two contributions: (1) Proposed a new neighbor sampler: CONCAT Sampler, which can be easily accelerated on hardware level while guaranteeing the test accuracy. (2) Designed a CONCAT-sampler-accelerator based on FPGA, with which the neighbor sampling process boosted to about 300-1000 times faster compared to the sampling process without it.

Via

Access Paper or Ask Questions

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

Jan 14, 2021

Letian Zhao, Rui Xu, Tianqi Wang, Teng Tian, Xiaotian Wang, Wei Wu, Chio-in Ieong, Xi Jin

Figure 1 for BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

Figure 2 for BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

Figure 3 for BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

Figure 4 for BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

Abstract:The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model parallelism has been widely recognized. We propose a new pipeline parallelism training framework, BaPipe, which can automatically explore pipeline parallelism training methods and balanced partition strategies for DNN distributed training. In BaPipe, each accelerator calculates the forward propagation and backward propagation of different parts of networks to implement the intra-batch pipeline parallelism strategy. BaPipe uses a new load balancing automatic exploration strategy that considers the parameters of DNN models and the computation, memory, and communication resources of accelerator clusters. We have trained different DNNs such as VGG-16, ResNet-50, and GNMT on GPU clusters and simulated the performance of different FPGA clusters. Compared with state-of-the-art data parallelism and pipeline parallelism frameworks, BaPipe provides up to 3.2x speedup and 4x memory reduction in various platforms.

Via

Access Paper or Ask Questions

A Scalable Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight and Workload Balancing

Jan 04, 2019

Tong Geng, Tianqi Wang, Ang Li, Xi Jin, Martin Herbordt

Figure 1 for A Scalable Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight and Workload Balancing

Figure 2 for A Scalable Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight and Workload Balancing

Figure 3 for A Scalable Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight and Workload Balancing

Figure 4 for A Scalable Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight and Workload Balancing

Abstract:Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this paper, we propose a framework called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. Experimental results show that FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the FPGA-to-FPGA bandwidth. With 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. Energy efficiency is evaluated with respect to GOPs/J. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.

* 20 pages, 20 figures

Via

Access Paper or Ask Questions