Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ahmet Inci

QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration

Jun 30, 2022

Ahmet Inci, Siri Garudanagiri Virupaksha, Aman Jain, Ting-Wu Chin, Venkata Vivek Thallam, Ruizhou Ding, Diana Marculescu

Figure 1 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration

Figure 2 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration

Figure 3 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration

Figure 4 for QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration

Abstract:As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied precision or quantization levels, and model compression techniques, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QUIDAM, a highly parameterized quantization-aware DNN accelerator and model co-exploration framework. Our framework can facilitate future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, number of total processing elements, and DNN configurations. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. With the proposed framework, we show that lightweight processing elements achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based implementation. Finally, due to the efficiency of the pre-characterized power, performance, and area models, QUIDAM can speed up the design exploration process by 3-4 orders of magnitude as it removes the need for expensive synthesis and characterization of each design.

* 25 pages, 12 figures. arXiv admin note: substantial text overlap with arXiv:2205.13045, arXiv:2205.08648

Via

Access Paper or Ask Questions

Efficient Deep Learning Using Non-Volatile Memory Technology

Jun 27, 2022

Ahmet Inci, Mehmet Meric Isgenc, Diana Marculescu

Figure 1 for Efficient Deep Learning Using Non-Volatile Memory Technology

Figure 2 for Efficient Deep Learning Using Non-Volatile Memory Technology

Figure 3 for Efficient Deep Learning Using Non-Volatile Memory Technology

Figure 4 for Efficient Deep Learning Using Non-Volatile Memory Technology

Abstract:Embedded machine learning (ML) systems have now become the dominant platform for deploying ML serving tasks and are projected to become of equal importance for training ML models. With this comes the challenge of overall efficient deployment, in particular low power and high throughput implementations, under stringent memory constraints. In this context, non-volatile memory (NVM) technologies such as STT-MRAM and SOT-MRAM have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While prior work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area performance and energy models for last-level caches implemented using conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2.2x and 2.4x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.

* This article will appear as a book chapter in the book titled "Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing". arXiv admin note: substantial text overlap with arXiv:2012.04559

Via

Access Paper or Ask Questions

QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality

May 20, 2022

Ahmet Inci, Siri Garudanagiri Virupaksha, Aman Jain, Venkata Vivek Thallam, Ruizhou Ding, Diana Marculescu

Figure 1 for QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality

Figure 2 for QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality

Figure 3 for QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality

Figure 4 for QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality

Abstract:As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied bit precision or quantization levels, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements (PE) into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QADAM, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate future research on design space exploration and Pareto-efficiency of DNN accelerators for various design choices such as bit precision, PE type, scratchpad sizes of PEs, global buffer size, number of total PEs, and DNN configurations. Our results show that different bit precisions and PE types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. We also show that the proposed lightweight processing elements (LightPEs) consistently achieve Pareto-optimal results in terms of accuracy and hardware-efficiency. With the proposed framework, we show that LightPEs achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based design.

* Accepted paper at the Machine Learning for Computer Architecture and Systems (MLArchSys) Workshop in conjunction with ISCA 2021. This is an extended version of arXiv:2205.08648

Via

Access Paper or Ask Questions

QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators

May 17, 2022

Ahmet Inci, Siri Garudanagiri Virupaksha, Aman Jain, Venkata Vivek Thallam, Ruizhou Ding, Diana Marculescu

Figure 1 for QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators

Figure 2 for QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators

Figure 3 for QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators

Figure 4 for QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators

Abstract:As the machine learning and systems community strives to achieve higher energy-efficiency through custom DNN accelerators and model compression techniques, there is a need for a design space exploration framework that incorporates quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QAPPA, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate the future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, device bandwidth, number of total processing elements in the the design, and DNN workloads. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our proposed lightweight processing elements achieve up to 4.9x more performance per area and energy improvement when compared to INT16 based implementation.

* Accepted paper at the On-Device Intelligence Workshop in conjunction with MLSys Conference 2021

Via

Access Paper or Ask Questions

DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning

Dec 08, 2020

Ahmet Inci, Mehmet Meric Isgenc, Diana Marculescu

Figure 1 for DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning

Figure 2 for DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning

Figure 3 for DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning

Figure 4 for DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning

Abstract:Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2x and 2.3x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.

* 12 pages, 10 figures

Via

Access Paper or Ask Questions

The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Dec 08, 2020

Ahmet Inci, Evgeny Bolotin, Yaosheng Fu, Gal Dalal, Shie Mannor, David Nellans, Diana Marculescu

Figure 1 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Figure 2 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Figure 3 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Figure 4 for The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Abstract:With deep reinforcement learning (RL) methods achieving results that exceed human capabilities in games, robotics, and simulated environments, continued scaling of RL training is crucial to its deployment in solving complex real-world problems. However, improving the performance scalability and power efficiency of RL training through understanding the architectural implications of CPU-GPU systems remains an open problem. In this work we investigate and improve the performance and power efficiency of distributed RL training on CPU-GPU systems by approaching the problem not solely from the GPU microarchitecture perspective but following a holistic system-level analysis approach. We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework and empirically identify the bottlenecks caused by GPU microarchitectural, algorithmic, and system-level design choices. We show that the GPU microarchitecture itself is well-balanced for state-of-the-art RL frameworks, but further investigation reveals that the number of actors running the environment interactions and the amount of hardware resources available to them are the primary performance and power efficiency limiters. To this end, we introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources when designing scalable and efficient CPU-GPU systems for RL training.

* To appear in the proceedings of the 6th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) 2020

Via

Access Paper or Ask Questions