Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Huizi Mao

VILA: On Pre-training for Visual Language Models

Dec 14, 2023

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han

Figure 1 for VILA: On Pre-training for Visual Language Models

Figure 2 for VILA: On Pre-training for Visual Language Models

Figure 3 for VILA: On Pre-training for Visual Language Models

Figure 4 for VILA: On Pre-training for Visual Language Models

Abstract:Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.

Via

Access Paper or Ask Questions

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

May 26, 2022

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han

Figure 1 for BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Figure 2 for BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Figure 3 for BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Figure 4 for BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Abstract:Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost.

* The first two authors contributed equally to this work. Project page: https://bevfusion.mit.edu

Via

Access Paper or Ask Questions

PatchNet -- Short-range Template Matching for Efficient Video Processing

Mar 10, 2021

Huizi Mao, Sibo Zhu, Song Han, William J. Dally

Figure 1 for PatchNet -- Short-range Template Matching for Efficient Video Processing

Figure 2 for PatchNet -- Short-range Template Matching for Efficient Video Processing

Figure 3 for PatchNet -- Short-range Template Matching for Efficient Video Processing

Figure 4 for PatchNet -- Short-range Template Matching for Efficient Video Processing

Abstract:Object recognition is a fundamental problem in many video processing tasks, accurately locating seen objects at low computation cost paves the way for on-device video recognition. We propose PatchNet, an efficient convolutional neural network to match objects in adjacent video frames. It learns the patchwise correlation features instead of pixel features. PatchNet is very compact, running at just 58MFLOPs, $5\times$ simpler than MobileNetV2. We demonstrate its application on two tasks, video object detection and visual object tracking. On ImageNet VID, PatchNet reduces the flops of R-FCN ResNet-101 by 5x and EfficientDet-D0 by 3.4x with less than 1% mAP loss. On OTB2015, PatchNet reduces SiamFC and SiamRPN by 2.5x with no accuracy loss. Experiments on Jetson Nano further demonstrate 2.8x to 4.3x speed-ups associated with flops reduction. Code is open sourced at https://github.com/RalphMao/PatchNet.

Via

Access Paper or Ask Questions

A Delay Metric for Video Object Detection: What Average Precision Fails to Tell

Aug 18, 2019

Huizi Mao, Xiaodong Yang, William J. Dally

Figure 1 for A Delay Metric for Video Object Detection: What Average Precision Fails to Tell

Figure 2 for A Delay Metric for Video Object Detection: What Average Precision Fails to Tell

Figure 3 for A Delay Metric for Video Object Detection: What Average Precision Fails to Tell

Figure 4 for A Delay Metric for Video Object Detection: What Average Precision Fails to Tell

Abstract:Average precision (AP) is a widely used metric to evaluate detection accuracy of image and video object detectors. In this paper, we analyze object detection from videos and point out that AP alone is not sufficient to capture the temporal nature of video object detection. To tackle this problem, we propose a comprehensive metric, average delay (AD), to measure and compare detection delay. To facilitate delay evaluation, we carefully select a subset of ImageNet VID, which we name as ImageNet VIDT with an emphasis on complex trajectories. By extensively evaluating a wide range of detectors on VIDT, we show that most methods drastically increase the detection delay but still preserve AP well. In other words, AP is not sensitive enough to reflect the temporal characteristics of a video object detector. Our results suggest that video object detection methods should be additionally evaluated with a delay metric, particularly for latency-critical applications such as autonomous vehicle perception.

* ICCV 2019

Via

Access Paper or Ask Questions

CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video

Sep 30, 2018

Huizi Mao, Taeyoung Kong, William J. Dally

Figure 1 for CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video

Figure 2 for CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video

Figure 3 for CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video

Figure 4 for CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video

Abstract:Detecting objects in a video is a compute-intensive task. In this paper we propose CaTDet, a system to speedup object detection by leveraging the temporal correlation in video. CaTDet consists of two DNN models that form a cascaded detector, and an additional tracker to predict regions of interests based on historic detections. We also propose a new metric, mean Delay(mD), which is designed for latency-critical video applications. Experiments on the KITTI dataset show that CaTDet reduces operation count by 5.1-8.7x with the same mean Average Precision(mAP) as the single-model Faster R-CNN detector and incurs additional delay of 0.3 frame. On CityPersons dataset, CaTDet achieves 13.0x reduction in operations with 0.8% mAP loss.

Via

Access Paper or Ask Questions

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Feb 05, 2018

Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally

Figure 1 for Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Figure 2 for Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Figure 3 for Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Figure 4 for Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Abstract:Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.

* ICLR 2018
* we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy

Via

Access Paper or Ask Questions

Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

Jun 05, 2017

Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally

Figure 1 for Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

Figure 2 for Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

Figure 3 for Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

Figure 4 for Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

Abstract:Sparsity helps reduce the computational complexity of deep neural networks by skipping zeros. Taking advantage of sparsity is listed as a high priority in next generation DNN accelerators such as TPU. The structure of sparsity, i.e., the granularity of pruning, affects the efficiency of hardware accelerator design as well as the prediction accuracy. Coarse-grained pruning creates regular sparsity patterns, making it more amenable for hardware acceleration but more challenging to maintain the same accuracy. In this paper we quantitatively measure the trade-off between sparsity regularity and prediction accuracy, providing insights in how to maintain accuracy while having more a more structured sparsity pattern. Our experimental results show that coarse-grained pruning can achieve a sparsity ratio similar to unstructured pruning without loss of accuracy. Moreover, due to the index saving effect, coarse-grained pruning is able to obtain a better compression ratio than fine-grained sparsity at the same accuracy threshold. Based on the recent sparse convolutional neural network accelerator (SCNN), our experiments further demonstrate that coarse-grained sparsity saves about 2x the memory references compared to fine-grained sparsity. Since memory reference is more than two orders of magnitude more expensive than arithmetic operations, the regularity of sparse structure leads to more efficient hardware design.

* submitted to NIPS 2017

Via

Access Paper or Ask Questions

Trained Ternary Quantization

Feb 23, 2017

Chenzhuo Zhu, Song Han, Huizi Mao, William J. Dally

Figure 1 for Trained Ternary Quantization

Figure 2 for Trained Ternary Quantization

Figure 3 for Trained Ternary Quantization

Figure 4 for Trained Ternary Quantization

Abstract:Deep neural networks are widely used in machine learning applications. However, the deployment of large neural networks models can be difficult to deploy on mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet. And our AlexNet model is trained from scratch, which means it's as easy as to train normal full precision model. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, only ternary values (2-bit weights) and scaling factors are needed, therefore our models are nearly 16x smaller than full-precision models. Our ternary models can also be viewed as sparse binary weight networks, which can potentially be accelerated with custom circuit. Experiments on CIFAR-10 show that the ternary models obtained by trained quantization method outperform full-precision models of ResNet-32,44,56 by 0.04%, 0.16%, 0.36%, respectively. On ImageNet, our model outperforms full-precision AlexNet model by 0.3% of Top-1 accuracy and outperforms previous ternary models by 3%.

* Accepted for Poster Presentation on ICLR 2017

Via

Access Paper or Ask Questions

DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Feb 21, 2017

Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran(+2 more)

Figure 1 for DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Figure 2 for DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Figure 3 for DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Figure 4 for DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Abstract:Modern deep neural networks have a large number of parameters, making them very hard to train. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network. Experiments show that DSD training can improve the performance for a wide range of CNNs, RNNs and LSTMs on the tasks of image classification, caption generation and speech recognition. On ImageNet, DSD improved the Top1 accuracy of GoogLeNet by 1.1%, VGG-16 by 4.3%, ResNet-18 by 1.2% and ResNet-50 by 1.1%, respectively. On the WSJ'93 dataset, DSD improved DeepSpeech and DeepSpeech2 WER by 2.0% and 1.1%. On the Flickr-8K dataset, DSD improved the NeuralTalk BLEU score by over 1.7. DSD is easy to use in practice: at training time, DSD incurs only one extra hyper-parameter: the sparsity ratio in the S step. At testing time, DSD doesn't change the network architecture or incur any inference overhead. The consistent and significant performance gain of DSD experiments shows the inadequacy of the current training methods for finding the best local optimum, while DSD effectively achieves superior optimization performance for finding a better solution. DSD models are available to download at https://songhan.github.io/DSD.

* Published as a conference paper at ICLR 2017

Via

Access Paper or Ask Questions

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Feb 20, 2017

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang(+2 more)

Figure 1 for ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Figure 2 for ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Figure 3 for ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Figure 4 for ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Abstract:Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of a data center. In order to speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose scheduler that encodes and partitions the compressed model to each PE for parallelism, and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the compressed model. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the compressed LSTM network, corresponding to 2.52 TOPS on the uncompressed one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

* Accepted as full paper in FPGA'17, Monterey, CA; Also appeared at 1st International Workshop on Efficient Methods for Deep Neural Networks at NIPS 2016, Barcelona, Spain

Via

Access Paper or Ask Questions