Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Naveen Suda

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations

Nov 18, 2024

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel(+10 more)

Abstract:This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB).

Via

Access Paper or Ask Questions

Towards Open-World Gesture Recognition

Jan 20, 2024

Junxiao Shen, Matthias De Lange, Xuhai "Orson" Xu, Enmin Zhou, Ran Tan, Naveen Suda, Maciej Lazarewicz, Per Ola Kristensson, Amy Karlson, Evan Strasnick

Abstract:Static machine learning methods in gesture recognition assume that training and test data come from the same underlying distribution. However, in real-world applications involving gesture recognition on wrist-worn devices, data distribution may change over time. We formulate this problem of adapting recognition models to new tasks, where new data patterns emerge, as open-world gesture recognition (OWGR). We propose leveraging continual learning to make machine learning models adaptive to new tasks without degrading performance on previously learned tasks. However, the exploration of parameters for questions around when and how to train and deploy recognition models requires time-consuming user studies and is sometimes impractical. To address this challenge, we propose a design engineering approach that enables offline analysis on a collected large-scale dataset with various parameters and compares different continual learning methods. Finally, design guidelines are provided to enhance the development of an open-world wrist-worn gesture recognition process.

Via

Access Paper or Ask Questions

Collapsible Linear Blocks for Super-Efficient Super Resolution

Mar 17, 2021

Kartikeya Bhardwaj, Milos Milosavljevic, Alex Chalfin, Naveen Suda, Liam O'Neil, Dibakar Gope, Lingchuan Meng, Ramon Matas, Danny Loh

Figure 1 for Collapsible Linear Blocks for Super-Efficient Super Resolution

Figure 2 for Collapsible Linear Blocks for Super-Efficient Super Resolution

Figure 3 for Collapsible Linear Blocks for Super-Efficient Super Resolution

Figure 4 for Collapsible Linear Blocks for Super-Efficient Super Resolution

Abstract:With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose SESR, a new class of Super-Efficient Super Resolution networks that significantly improve image quality and reduce computational complexity. Detailed experiments across six benchmark datasets demonstrate that SESR achieves similar or better image quality than state-of-the-art models while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. As a result, SESR can be used on constrained hardware to perform x2 (1080p to 4K) and x4 SISR (1080p to 8K). Towards this, we simulate hardware performance numbers for a commercial mobile Neural Processing Unit (NPU) for 1080p to 4K (x2) and 1080p to 8K (x4) SISR. Our results highlight the challenges faced by super resolution on AI accelerators and demonstrate that SESR is significantly faster than existing models. Overall, SESR establishes a new Pareto frontier on the quality (PSNR)-computation relationship for the super resolution task.

Via

Access Paper or Ask Questions

EdgeAI: A Vision for Deep Learning in IoT Era

Oct 23, 2019

Kartikeya Bhardwaj, Naveen Suda, Radu Marculescu

Figure 1 for EdgeAI: A Vision for Deep Learning in IoT Era

Figure 2 for EdgeAI: A Vision for Deep Learning in IoT Era

Figure 3 for EdgeAI: A Vision for Deep Learning in IoT Era

Figure 4 for EdgeAI: A Vision for Deep Learning in IoT Era

Abstract:The significant computational requirements of deep learning present a major bottleneck for its large-scale adoption on hardware-constrained IoT-devices. Here, we envision a new paradigm called EdgeAI to address major impediments associated with deploying deep networks at the edge. Specifically, we discuss the existing directions in computation-aware deep learning and describe two new challenges in the IoT era: (1) Data-independent deployment of learning, and (2) Communication-aware distributed inference. We further present new directions from our recent research to alleviate the latter two challenges. Overcoming these challenges is crucial for rapid adoption of learning on IoT-devices in order to truly enable EdgeAI.

* To appear in IEEE Design and Test

Via

Access Paper or Ask Questions

Dream Distillation: A Data-Independent Model Compression Framework

May 17, 2019

Kartikeya Bhardwaj, Naveen Suda, Radu Marculescu

Figure 1 for Dream Distillation: A Data-Independent Model Compression Framework

Figure 2 for Dream Distillation: A Data-Independent Model Compression Framework

Abstract:Model compression is eminently suited for deploying deep learning on IoT-devices. However, existing model compression techniques rely on access to the original or some alternate dataset. In this paper, we address the model compression problem when no real data is available, e.g., when data is private. To this end, we propose Dream Distillation, a data-independent model compression framework. Our experiments show that Dream Distillation can achieve 88.5% accuracy on the CIFAR-10 test set without actually training on the original data!

* Presented at the ICML 2019 Joint Workshop on On-Device Machine Learning & Compact Deep Neural Network Representations (ODML-CDNNR)

Via

Access Paper or Ask Questions

Rethinking Machine Learning Development and Deployment for Edge Devices

Jun 20, 2018

Liangzhen Lai, Naveen Suda

Figure 1 for Rethinking Machine Learning Development and Deployment for Edge Devices

Figure 2 for Rethinking Machine Learning Development and Deployment for Edge Devices

Figure 3 for Rethinking Machine Learning Development and Deployment for Edge Devices

Figure 4 for Rethinking Machine Learning Development and Deployment for Edge Devices

Abstract:Machine learning (ML), especially deep learning is made possible by the availability of big data, enormous compute power and, often overlooked, development tools or frameworks. As the algorithms become mature and efficient, more and more ML inference is moving out of datacenters/cloud and deployed on edge devices. This model deployment process can be challenging as the deployment environment and requirements can be substantially different from those during model development. In this paper, we propose a new ML development and deployment approach that is specially designed and optimized for inference-only deployment on edge devices. We build a prototype and demonstrate that this approach can address all the deployment challenges and result in more efficient and high-quality solutions.

Via

Access Paper or Ask Questions

Federated Learning with Non-IID Data

Jun 02, 2018

Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, Vikas Chandra

Figure 1 for Federated Learning with Non-IID Data

Figure 2 for Federated Learning with Non-IID Data

Figure 3 for Federated Learning with Non-IID Data

Figure 4 for Federated Learning with Non-IID Data

Abstract:Federated learning enables resource-constrained edge compute devices, such as mobile phones and IoT devices, to learn a shared model for prediction, while keeping the training data local. This decentralized approach to train models provides privacy, security, regulatory and economic benefits. In this work, we focus on the statistical challenge of federated learning when local data is non-IID. We first show that the accuracy of federated learning reduces significantly, by up to 55% for neural networks trained for highly skewed non-IID data, where each client device trains only on a single class of data. We further show that this accuracy reduction can be explained by the weight divergence, which can be quantified by the earth mover's distance (EMD) between the distribution over classes on each device and the population distribution. As a solution, we propose a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices. Experiments show that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.

Via

Access Paper or Ask Questions

Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

May 30, 2018

Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, Hadi Esmaeilzadeh

Figure 1 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Figure 2 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Figure 3 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Figure 4 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Abstract:Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent accuracy loss, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of BitFusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss and Stripes. In the same area, frequency, and process technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss. Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, BitFusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while BitFusion merely consumes 895 milliwatts of power.

Via

Access Paper or Ask Questions

Hello Edge: Keyword Spotting on Microcontrollers

Feb 14, 2018

Yundong Zhang, Naveen Suda, Liangzhen Lai, Vikas Chandra

Figure 1 for Hello Edge: Keyword Spotting on Microcontrollers

Figure 2 for Hello Edge: Keyword Spotting on Microcontrollers

Figure 3 for Hello Edge: Keyword Spotting on Microcontrollers

Figure 4 for Hello Edge: Keyword Spotting on Microcontrollers

Abstract:Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms. Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability. The design of neural network architecture for KWS must consider these constraints. In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers. We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements. We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures. DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNN model with similar number of parameters.

* Code available in github at https://github.com/ARM-software/ML-KWS-for-MCU

Via

Access Paper or Ask Questions

Not All Ops Are Created Equal!

Jan 29, 2018

Liangzhen Lai, Naveen Suda, Vikas Chandra

Figure 1 for Not All Ops Are Created Equal!

Figure 2 for Not All Ops Are Created Equal!

Figure 3 for Not All Ops Are Created Equal!

Figure 4 for Not All Ops Are Created Equal!

Abstract:Efficient and compact neural network models are essential for enabling the deployment on mobile and embedded devices. In this work, we point out that typical design metrics for gauging the efficiency of neural network architectures -- total number of operations and parameters -- are not sufficient. These metrics may not accurately correlate with the actual deployment metrics such as energy and memory footprint. We show that throughput and energy varies by up to 5X across different neural network operation types on an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore, we show that the memory required for activation data also need to be considered, apart from the model parameters, for network architecture exploration studies.

* Accepted at SysML Conference 2018

Via

Access Paper or Ask Questions