Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiacheng Yang

Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

Jan 27, 2025

Yinghan Li, Yifei Li, Jiejing Zhang, Bujiao Chen, Xiaotong Chen, Lian Duan, Yejun Jin, Zheng Li, Xuanyu Liu, Haoyu Wang(+6 more)

Figure 1 for Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

Abstract:It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU.

* 11 pages

Via

Access Paper or Ask Questions

CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Oct 15, 2024

Chunlei Meng, Jiacheng Yang, Wei Lin, Bowen Liu, Hongda Zhang, chun ouyang, Zhongxue Gan

Figure 1 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Figure 2 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Figure 3 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Figure 4 for CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Abstract:Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

Accelerating Graph Neural Networks on Real Processing-In-Memory Systems

Feb 26, 2024

Christina Giannoula, Peiming Yang, Ivan Fernandez Vega, Jiacheng Yang, Yu Xin Li, Juan Gomez Luna, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko

Figure 1 for Accelerating Graph Neural Networks on Real Processing-In-Memory Systems

Figure 2 for Accelerating Graph Neural Networks on Real Processing-In-Memory Systems

Figure 3 for Accelerating Graph Neural Networks on Real Processing-In-Memory Systems

Figure 4 for Accelerating Graph Neural Networks on Real Processing-In-Memory Systems

Abstract:Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML framework that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively, to match their algorithmic nature. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim will be open-sourced to enable the widespread use of PIM systems in GNNs.

Via

Access Paper or Ask Questions

LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training

Dec 02, 2021

Ningyu Zhang, Hongbin Ye, Jiacheng Yang, Shumin Deng, Chuanqi Tan, Mosha Chen, Songfang Huang, Fei Huang, Huajun Chen

Figure 1 for LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training

Figure 2 for LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training

Figure 3 for LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training

Figure 4 for LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training

Abstract:Natural language generation from structured data mainly focuses on surface-level descriptions, suffering from uncontrollable content selection and low fidelity. Previous works leverage logical forms to facilitate logical knowledge-conditioned text generation. Though achieving remarkable progress, they are data-hungry, which makes the adoption for real-world applications challenging with limited data. To this end, this paper proposes a unified framework for logical knowledge-conditioned text generation in the few-shot setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach leverages self-training and samples pseudo logical forms based on content and structure consistency. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.

Via

Access Paper or Ask Questions

GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

Apr 30, 2020

Hanrui Wang, Kuan Wang, Jiacheng Yang, Linxiao Shen, Nan Sun, Hae-Seung Lee, Song Han

Figure 1 for GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

Figure 2 for GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

Figure 3 for GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

Figure 4 for GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

Abstract:Automatic transistor sizing is a challenging problem in circuit design due to the large design space, complex performance trade-offs, and fast technological advancements. Although there has been plenty of work on transistor sizing targeting on one circuit, limited research has been done on transferring the knowledge from one circuit to another to reduce the re-design overhead. In this paper, we present GCN-RL Circuit Designer, leveraging reinforcement learning (RL) to transfer the knowledge between different technology nodes and topologies. Moreover, inspired by the simple fact that circuit is a graph, we learn on the circuit topology representation with graph convolutional neural networks (GCN). The GCN-RL agent extracts features of the topology graph whose vertices are transistors, edges are wires. Our learning-based optimization consistently achieves the highest Figures of Merit (FoM) on four different circuits compared with conventional black-box optimization methods (Bayesian Optimization, Evolutionary Algorithms), random search, and human expert designs. Experiments on transfer learning between five technology nodes and two circuit topologies demonstrate that RL with transfer learning can achieve much higher FoMs than methods without knowledge transfer. Our transferable optimization method makes transistor sizing and design porting more effective and efficient.

* Accepted to the 57th Design Automation Conference (DAC 2020); 6 pages, 8 figures

Via

Access Paper or Ask Questions

Towards Making the Most of BERT in Neural Machine Translation

Aug 30, 2019

Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan Zhang, Lei Li

Figure 1 for Towards Making the Most of BERT in Neural Machine Translation

Figure 2 for Towards Making the Most of BERT in Neural Machine Translation

Figure 3 for Towards Making the Most of BERT in Neural Machine Translation

Figure 4 for Towards Making the Most of BERT in Neural Machine Translation

Abstract:GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (\method) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed Cnmt consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; \item a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and b)a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show \method gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pre-training aided NMT by 1.4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentence-pairs, our base model still significantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score.

Via

Access Paper or Ask Questions

Learning to Design Circuits

Jan 17, 2019

Hanrui Wang, Jiacheng Yang, Hae-Seung Lee, Song Han

Figure 1 for Learning to Design Circuits

Figure 2 for Learning to Design Circuits

Figure 3 for Learning to Design Circuits

Figure 4 for Learning to Design Circuits

Abstract:Analog IC design relies on human experts to search for parameters that satisfy circuit specifications with their experience and intuitions, which is highly labor intensive, time consuming and suboptimal. Machine learning is a promising tool to automate this process. However, supervised learning is difficult for this task due to the low availability of training data: 1) Circuit simulation is slow, thus generating large-scale dataset is time-consuming; 2) Most circuit designs are propitiatory IPs within individual IC companies, making it expensive to collect large-scale datasets. We propose Learning to Design Circuits (L2DC) to leverage reinforcement learning that learns to efficiently generate new circuits data and to optimize circuits. We fix the schematic, and optimize the parameters of the transistors automatically by training an RL agent with no prior knowledge about optimizing circuits. After iteratively getting observations, generating a new set of transistor parameters, getting a reward, and adjusting the model, L2DC is able to optimize circuits. We evaluate L2DC on two transimpedance amplifiers. Trained for a day, our RL agent can achieve comparable or better performance than human experts trained for a quarter. It first learns to meet hard-constraints (eg. gain, bandwidth), and then learns to optimize good-to-have targets (eg. area, power). Compared with grid search-aided human design, L2DC can achieve $\mathbf{250}\boldsymbol{\times}$ higher sample efficiency with comparable performance. Under the same runtime constraint, the performance of L2DC is also better than Bayesian Optimization.

* The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Path-Level Network Transformation for Efficient Architecture Search

Jun 07, 2018

Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, Yong Yu

Figure 1 for Path-Level Network Transformation for Efficient Architecture Search

Figure 2 for Path-Level Network Transformation for Efficient Architecture Search

Figure 3 for Path-Level Network Transformation for Efficient Architecture Search

Figure 4 for Path-Level Network Transformation for Efficient Architecture Search

Abstract:We introduce a new function-preserving transformation for efficient neural architecture search. This network transformation allows reusing previously trained networks and existing successful architectures that improves sample efficiency. We aim to address the limitation of current network transformation operations that can only perform layer-level architecture modifications, such as adding (pruning) filters or inserting (removing) a layer, which fails to change the topology of connection paths. Our proposed path-level transformation operations enable the meta-controller to modify the path topology of the given network while keeping the merits of reusing weights, and thus allow efficiently designing effective structures with complex path topologies like Inception models. We further propose a bidirectional tree-structured reinforcement learning meta-controller to explore a simple yet highly expressive tree-structured architecture space that can be viewed as a generalization of multi-branch architectures. We experimented on the image classification datasets with limited computational resources (about 200 GPU-hours), where we observed improved parameter efficiency and better test results (97.70% test accuracy on CIFAR-10 with 14.3M parameters and 74.6% top-1 accuracy on ImageNet in the mobile setting), demonstrating the effectiveness and transferability of our designed architectures.

* ICML 2018

Via

Access Paper or Ask Questions

MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence

Dec 02, 2017

Lianmin Zheng, Jiacheng Yang, Han Cai, Weinan Zhang, Jun Wang, Yong Yu

Figure 1 for MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence

Abstract:We introduce MAgent, a platform to support research and development of many-agent reinforcement learning. Unlike previous research platforms on single or multi-agent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents. Within the interactions among a population of agents, it enables not only the study of learning algorithms for agents' optimal polices, but more importantly, the observation and understanding of individual agent's behaviors and social phenomena emerging from the AI society, including communication languages, leaderships, altruism. MAgent is highly scalable and can host up to one million agents on a single GPU server. MAgent also provides flexible configurations for AI researchers to design their customized environments and agents. In this demo, we present three environments designed on MAgent and show emerged collective intelligence by learning from scratch.

* NIPS 2017 & AAAI 2018 Demo

Via

Access Paper or Ask Questions