Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Liyan Zheng

Optimal Kernel Orchestration for Tensor Programs with Korch

Jun 13, 2024

Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai

Figure 1 for Optimal Kernel Orchestration for Tensor Programs with Korch

Figure 2 for Optimal Kernel Orchestration for Tensor Programs with Korch

Figure 3 for Optimal Kernel Orchestration for Tensor Programs with Korch

Figure 4 for Optimal Kernel Orchestration for Tensor Programs with Korch

Abstract:Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

* Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 3 (2024) 755-769
* Fix some typos in the ASPLOS version

Via

Access Paper or Ask Questions

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Jul 11, 2023

Zixuan Ma, Haojie Wang, Jingze Xing, Liyan Zheng, Chen Zhang, Huanqi Cao, Kezhao Huang, Shizhi Tang, Penghan Wang, Jidong Zhai

Figure 1 for PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Figure 2 for PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Figure 3 for PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Figure 4 for PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Abstract:Deep neural networks (DNNs) are of critical use in different domains. To accelerate DNN computation, tensor compilers are proposed to generate efficient code on different domain-specific accelerators. Existing tensor compilers mainly focus on optimizing computation efficiency. However, memory access is becoming a key performance bottleneck because the computational performance of accelerators is increasing much faster than memory performance. The lack of direct description of memory access and data dependence in current tensor compilers' intermediate representation (IR) brings significant challenges to generate memory-efficient code. In this paper, we propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators by considering both computation and data movement optimizations. IntelliGen represent a DNN program using GIR, which includes primitives indicating its computation, data movement, and parallel strategies. This information will be further composed as an instruction-level dataflow graph to perform holistic optimizations by searching different memory access patterns and computation operations, and generating memory-efficient code on different hardware. We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x(1.28x, 1.23x, and 2.31x on average), respectively, compared to current most performant frameworks.

* 12 pages, 14 figures

Via

Access Paper or Ask Questions

OLLIE: Derivation-based Tensor Program Optimizer

Aug 02, 2022

Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shizhi Tang, Lei Xie, Kezhao Huang, Zhihao Jia

Figure 1 for OLLIE: Derivation-based Tensor Program Optimizer

Figure 2 for OLLIE: Derivation-based Tensor Program Optimizer

Figure 3 for OLLIE: Derivation-based Tensor Program Optimizer

Figure 4 for OLLIE: Derivation-based Tensor Program Optimizer

Abstract:Boosting the runtime performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world tasks. Existing approaches to optimizing the tensor algebra expression of a DNN only consider expressions representable by a fixed set of predefined operators, missing possible optimization opportunities between general expressions. We propose OLLIE, the first derivation-based tensor program optimizer. OLLIE optimizes tensor programs by leveraging transformations between general tensor algebra expressions, enabling a significantly larger expression search space that includes those supported by prior work as special cases. OLLIE uses a hybrid derivation-based optimizer that effectively combines explorative and guided derivations to quickly discover highly optimized expressions. Evaluation on seven DNNs shows that OLLIE can outperform existing optimizers by up to 2.73$\times$ (1.46$\times$ on average) on an A100 GPU and up to 2.68$\times$ (1.51$\times$) on a V100 GPU, respectively.

Via

Access Paper or Ask Questions