Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Horowitz

Enabling and Accelerating Dynamic Vision Transformer Inference for Real-Time Applications

Dec 06, 2022

Kavya Sreedhar, Jason Clemons, Rangharajan Venkatesan, Stephen W. Keckler, Mark Horowitz

Abstract:Many state-of-the-art deep learning models for computer vision tasks are based on the transformer architecture. Such models can be computationally expensive and are typically statically set to meet the deployment scenario. However, in real-time applications, the resources available for every inference can vary considerably and be smaller than what state-of-the-art models use. We can use dynamic models to adapt the model execution to meet real-time application resource constraints. While prior dynamic work has primarily minimized resource utilization for less complex input images while maintaining accuracy and focused on CNNs and early transformer models such as BERT, we adapt vision transformers to meet system dynamic resource constraints, independent of the input image. We find that unlike early transformer models, recent state-of-the-art vision transformers heavily rely on convolution layers. We show that pretrained models are fairly resilient to skipping computation in the convolution and self-attention layers, enabling us to create a low-overhead system for dynamic real-time inference without additional training. Finally, we create a optimized accelerator for these dynamic vision transformers in a 5nm technology. The PE array occupies 2.26mm$^2$ and is 17 times faster than a NVIDIA TITAN V GPU for state-of-the-art transformer-based models for semantic segmentation.

Via

Access Paper or Ask Questions

Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

Feb 10, 2019

Kentaro Yoshioka, Edward Lee, Simon Wong, Mark Horowitz

Figure 1 for Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

Figure 2 for Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

Figure 3 for Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

Figure 4 for Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

Abstract:Real-time CNN-based object detection models for applications like surveillance can achieve high accuracy but are computationally expensive. Recent works have shown 10 to 100x reduction in computation cost for inference by using domain-specific networks. However, prior works have focused on inference only. If the domain model requires frequent retraining, training costs can pose a significant bottleneck. To address this, we propose Dataset Culling: a pipeline to reduce the size of the dataset for training, based on the prediction difficulty. Images that are easy to classify are filtered out since they contribute little to improving the accuracy. The difficulty is measured using our proposed confidence loss metric with little computational overhead. Dataset Culling is extended to optimize the image resolution to further improve training and inference costs. We develop fixed-angle, long-duration video datasets across several domains, and we show that Dataset Culling can reduce the training costs by 47x with no accuracy loss or even with slight improvement. Codes are available: https://github.com/kentaroy47/DatasetCulling

* Under review for ICIP. 5 pages

Via

Access Paper or Ask Questions

Training Domain Specific Models for Energy-Efficient Object Detection

Nov 18, 2018

Kentaro Yoshioka, Edward Lee, Mark Horowitz

Figure 1 for Training Domain Specific Models for Energy-Efficient Object Detection

Figure 2 for Training Domain Specific Models for Energy-Efficient Object Detection

Figure 3 for Training Domain Specific Models for Energy-Efficient Object Detection

Figure 4 for Training Domain Specific Models for Energy-Efficient Object Detection

Abstract:We propose an end-to-end framework for training domain specific models (DSMs) to obtain both high accuracy and computational efficiency for object detection tasks. DSMs are trained with distillation \cite{hinton2015distilling} and focus on achieving high accuracy at a limited domain (e.g. fixed view of an intersection). We argue that DSMs can capture essential features well even with a small model size, enabling higher accuracy and efficiency than traditional techniques. In addition, we improve the training efficiency by reducing the dataset size by culling easy to classify images from the training set. For the limited domain, we observed that compact DSMs significantly surpass the accuracy of COCO trained models of the same size. By training on a compact dataset, we show that with an accuracy drop of only 3.6\%, the training time can be reduced by 93\%. The codes are uploaded in https://github.com/kentaroy47/training-domain-specific-models.

Via

Access Paper or Ask Questions

A Systematic Approach to Blocking Convolutional Neural Networks

Jun 14, 2016

Xuan Yang, Jing Pu, Blaine Burton Rister, Nikhil Bhagdikar, Stephen Richardson, Shahar Kvatinsky, Jonathan Ragan-Kelley, Ardavan Pedram, Mark Horowitz

Figure 1 for A Systematic Approach to Blocking Convolutional Neural Networks

Figure 2 for A Systematic Approach to Blocking Convolutional Neural Networks

Figure 3 for A Systematic Approach to Blocking Convolutional Neural Networks

Figure 4 for A Systematic Approach to Blocking Convolutional Neural Networks

Abstract:Convolutional Neural Networks (CNNs) are the state of the art solution for many computer vision problems, and many researchers have explored optimized implementations. Most implementations heuristically block the computation to deal with the large data sizes and high data reuse of CNNs. This paper explores how to block CNN computations for memory locality by creating an analytical model for CNN-like loop nests. Using this model we automatically derive optimized blockings for common networks that improve the energy efficiency of custom hardware implementations by up to an order of magnitude. Compared to traditional CNN CPU implementations based on highly-tuned, hand-optimized BLAS libraries,our x86 programs implementing the optimal blocking reduce the number of memory accesses by up to 90%.

Via

Access Paper or Ask Questions