Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Keisuke Sugiura

InstantFT: An FPGA-Based Runtime Subsecond Fine-tuning of CNN Models

Jun 06, 2025

Keisuke Sugiura, Hiroki Matsutani

Abstract:Training deep neural networks (DNNs) requires significantly more computation and memory than inference, making runtime adaptation of DNNs challenging on resource-limited IoT platforms. We propose InstantFT, an FPGA-based method for ultra-fast CNN fine-tuning on IoT devices, by optimizing the forward and backward computations in parameter-efficient fine-tuning (PEFT). Experiments on datasets with concept drift demonstrate that InstantFT fine-tunes a pre-trained CNN 17.4x faster than existing Low-Rank Adaptation (LoRA)-based approaches, while achieving comparable accuracy. Our FPGA-based InstantFT reduces the fine-tuning time to just 0.36s and improves energy-efficiency by 16.3x, enabling on-the-fly adaptation of CNNs to non-stationary data distributions.

Via

Access Paper or Ask Questions

ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

Jan 08, 2025

Keisuke Sugiura, Hiroki Matsutani

Abstract:Zeroth-order (ZO) optimization is being recognized as a simple yet powerful alternative to standard backpropagation (BP)-based training. Notably, ZO optimization allows for training with only forward passes and (almost) the same memory as inference, making it well-suited for edge devices with limited computing and memory resources. In this paper, we propose ZO-based on-device learning (ODL) methods for full-precision and 8-bit quantized deep neural networks (DNNs), namely ElasticZO and ElasticZO-INT8. ElasticZO lies in the middle between pure ZO- and pure BP-based approaches, and is based on the idea to employ BP for the last few layers and ZO for the remaining layers. ElasticZO-INT8 achieves integer arithmetic-only ZO-based training for the first time, by incorporating a novel method for computing quantized ZO gradients from integer cross-entropy loss values. Experimental results on the classification datasets show that ElasticZO effectively addresses the slow convergence of vanilla ZO and shrinks the accuracy gap to BP-based training. Compared to vanilla ZO, ElasticZO achieves 5.2-9.5% higher accuracy with only 0.072-1.7% memory overhead, and can handle fine-tuning tasks as well as full training. ElasticZO-INT8 further reduces the memory usage and training time by 1.46-1.60x and 1.38-1.42x without compromising the accuracy. These results demonstrate a better tradeoff between accuracy and training cost compared to pure ZO- and BP-based approaches, and also highlight the potential of ZO optimization in on-device learning.

Via

Access Paper or Ask Questions

FPGA-Accelerated Correspondence-free Point Cloud Registration with PointNet Features

Apr 01, 2024

Keisuke Sugiura, Hiroki Matsutani

Figure 1 for FPGA-Accelerated Correspondence-free Point Cloud Registration with PointNet Features

Figure 2 for FPGA-Accelerated Correspondence-free Point Cloud Registration with PointNet Features

Figure 3 for FPGA-Accelerated Correspondence-free Point Cloud Registration with PointNet Features

Figure 4 for FPGA-Accelerated Correspondence-free Point Cloud Registration with PointNet Features

Abstract:Point cloud registration serves as a basis for vision and robotic applications including 3D reconstruction and mapping. Despite significant improvements on the quality of results, recent deep learning approaches are computationally expensive and power-hungry, making them difficult to deploy on resource-constrained edge devices. To tackle this problem, in this paper, we propose a fast, accurate, and robust registration for low-cost embedded FPGAs. Based on a parallel and pipelined PointNet feature extractor, we develop custom accelerator cores namely PointLKCore and ReAgentCore, for two different learning-based methods. They are both correspondence-free and computationally efficient as they avoid the costly feature matching step involving nearest-neighbor search. The proposed cores are implemented on the Xilinx ZCU104 board and evaluated using both synthetic and real-world datasets, showing the substantial improvements in the trade-offs between runtime and registration quality. They run 44.08-45.75x faster than ARM Cortex-A53 CPU and offer 1.98-11.13x speedups over Intel Xeon CPU and Nvidia Jetson boards, while consuming less than 1W and achieving 163.11-213.58x energy-efficiency compared to Nvidia GeForce GPU. The proposed cores are more robust to noise and large initial misalignments than the classical methods and quickly find reasonable solutions in less than 15ms, demonstrating the real-time performance.

* 27 pages, 19 figures

Via

Access Paper or Ask Questions

A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE

Jan 05, 2024

Ikumi Okubo, Keisuke Sugiura, Hiroki Matsutani

Abstract:Transformer is an emerging neural network model with attention mechanism. It has been adopted to various tasks and achieved a favorable accuracy compared to CNNs and RNNs. While the attention mechanism is recognized as a general-purpose component, many of the Transformer models require a significant number of parameters compared to the CNN-based ones. To mitigate the computational complexity, recently, a hybrid approach has been proposed, which uses ResNet as a backbone architecture and replaces a part of its convolution layers with an MHSA (Multi-Head Self-Attention) mechanism. In this paper, we significantly reduce the parameter size of such models by using Neural ODE (Ordinary Differential Equation) as a backbone architecture instead of ResNet. The proposed hybrid model reduces the parameter size by 94.6% compared to the CNN-based ones without degrading the accuracy. We then deploy the proposed model on a modest-sized FPGA device for edge computing. To further reduce FPGA resource utilization, we quantize the model following QAT (Quantization Aware Training) scheme instead of PTQ (Post Training Quantization) to suppress the accuracy loss. As a result, an extremely lightweight Transformer-based model can be implemented on resource-limited FPGAs. The weights of the feature extraction network are stored on-chip to minimize the memory transfer overhead, allowing faster inference. By eliminating the overhead of memory transfers, inference can be executed seamlessly, leading to accelerated inference. The proposed FPGA implementation achieves 12.8x speedup and 9.21x energy efficiency compared to ARM Cortex-A53 CPU.

Via

Access Paper or Ask Questions

An FPGA-Based Accelerator for Graph Embedding using Sequential Training Algorithm

Dec 23, 2023

Kazuki Sunaga, Keisuke Sugiura, Hiroki Matsutani

Abstract:A graph embedding is an emerging approach that can represent a graph structure with a fixed-length low-dimensional vector. node2vec is a well-known algorithm to obtain such a graph embedding by sampling neighboring nodes on a given graph with a random walk technique. However, the original node2vec algorithm typically relies on a batch training of graph structures; thus, it is not suited for applications in which the graph structure changes after the deployment. In this paper, we focus on node2vec applications for IoT (Internet of Things) environments. To handle the changes of graph structures after the IoT devices have been deployed in edge environments, in this paper we propose to combine an online sequential training algorithm with node2vec. The proposed sequentially-trainable model is implemented on a resource-limited FPGA (Field-Programmable Gate Array) device to demonstrate the benefits of our approach. The proposed FPGA implementation achieves up to 205.25 times speedup compared to the original model on CPU. Evaluation results using dynamic graphs show that although the original model decreases the accuracy, the proposed sequential model can obtain better graph embedding that can increase the accuracy even when the graph structure is changed.

Via

Access Paper or Ask Questions

An Integrated FPGA Accelerator for Deep Learning-based 2D/3D Path Planning

Jun 30, 2023

Keisuke Sugiura, Hiroki Matsutani

Abstract:Path planning is a crucial component for realizing the autonomy of mobile robots. However, due to limited computational resources on mobile robots, it remains challenging to deploy state-of-the-art methods and achieve real-time performance. To address this, we propose P3Net (PointNet-based Path Planning Networks), a lightweight deep-learning-based method for 2D/3D path planning, and design an IP core (P3NetCore) targeting FPGA SoCs (Xilinx ZCU104). P3Net improves the algorithm and model architecture of the recently-proposed MPNet. P3Net employs an encoder with a PointNet backbone and a lightweight planning network in order to extract robust point cloud features and sample path points from a promising region. P3NetCore is comprised of the fully-pipelined point cloud encoder, batched bidirectional path planner, and parallel collision checker, to cover most part of the algorithm. On the 2D (3D) datasets, P3Net with the IP core runs 24.54-149.57x and 6.19-115.25x (10.03-59.47x and 3.38-28.76x) faster than ARM Cortex CPU and Nvidia Jetson while only consuming 0.255W (0.809W), and is up to 1049.42x (133.84x) power-efficient than the workstation. P3Net improves the success rate by up to 28.2% and plans a near-optimal path, leading to a significantly better tradeoff between computation and solution quality than MPNet and the state-of-the-art sampling-based methods.

* 25 pages, 17 figures

Via

Access Paper or Ask Questions

An Efficient Accelerator for Deep Learning-based Point Cloud Registration on FPGAs

Mar 11, 2022

Keisuke Sugiura, Hiroki Matsutani

Figure 1 for An Efficient Accelerator for Deep Learning-based Point Cloud Registration on FPGAs

Figure 2 for An Efficient Accelerator for Deep Learning-based Point Cloud Registration on FPGAs

Figure 3 for An Efficient Accelerator for Deep Learning-based Point Cloud Registration on FPGAs

Figure 4 for An Efficient Accelerator for Deep Learning-based Point Cloud Registration on FPGAs

Abstract:Point cloud registration is the basis for many robotic applications such as odometry and Simultaneous Localization And Mapping (SLAM), which are increasingly important for autonomous mobile robots. Computational resources and power budgets are limited on these robots, thereby motivating the development of resource-efficient registration method on low-cost FPGAs. In this paper, we propose a novel approach for FPGA-based 3D point cloud registration built upon a recent deep learning-based method, PointNetLK. A highly-efficient FPGA accelerator for PointNet-based feature extraction is designed and implemented on both low-cost and mid-range FPGAs (Avnet Ultra96v2 and Xilinx ZCU104). Our accelerator design is evaluated in terms of registration speed, accuracy, resource usage, and power consumption. Experimental results show that PointNetLK with our accelerator achieves up to 21.34x and 69.60x faster registration speed than the CPU counterpart and ICP, respectively, while only consuming 722mW and maintaining the same level of accuracy.

* 6 pages, 11 figures

Via

Access Paper or Ask Questions

A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs

Jul 27, 2021

Hiroki Kawakami, Hirohisa Watanabe, Keisuke Sugiura, Hiroki Matsutani

Figure 1 for A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs

Figure 2 for A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs

Figure 3 for A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs

Figure 4 for A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs

Abstract:Although high-performance deep neural networks are in high demand in edge environments, computation resources are strictly limited in edge devices, and light-weight neural network techniques, such as Depthwise Separable Convolution (DSC), have been developed. ResNet is one of conventional deep neural network models that stack a lot of layers and parameters for a higher accuracy. To reduce the parameter size of ResNet, by utilizing a similarity to ODE (Ordinary Differential Equation), Neural ODE repeatedly uses most of weight parameters instead of having a lot of different parameters. Thus, Neural ODE becomes significantly small compared to that of ResNet so that it can be implemented in resource-limited edge devices. In this paper, a combination of Neural ODE and DSC, called dsODENet, is designed and implemented for FPGAs (Field-Programmable Gate Arrays). dsODENet is then applied to edge domain adaptation as a practical use case and evaluated with image classification datasets. It is implemented on Xilinx ZCU104 board and evaluated in terms of domain adaptation accuracy, training speed, FPGA resource utilization, and speedup rate compared to a software execution. The results demonstrate that dsODENet is comparable to or slightly better than our baseline Neural ODE implementation in terms of domain adaptation accuracy, while the total parameter size without pre- and post-processing layers is reduced by 54.2% to 79.8%. The FPGA implementation accelerates the prediction tasks by 27.9 times faster than a software implementation.

Via

Access Paper or Ask Questions

Particle Filter-based vs. Graph-based: SLAM Acceleration on Low-end FPGAs

Mar 17, 2021

Keisuke Sugiura, Hiroki Matsutani

Figure 1 for Particle Filter-based vs. Graph-based: SLAM Acceleration on Low-end FPGAs

Figure 2 for Particle Filter-based vs. Graph-based: SLAM Acceleration on Low-end FPGAs

Figure 3 for Particle Filter-based vs. Graph-based: SLAM Acceleration on Low-end FPGAs

Figure 4 for Particle Filter-based vs. Graph-based: SLAM Acceleration on Low-end FPGAs

Abstract:SLAM allows a robot to continuously perceive the surrounding environment and locate itself correctly. However, its high computational complexity limits the practical use of SLAM in resource-constrained computing platforms. We propose a resource-efficient FPGA-based accelerator and apply it to two major SLAM methods: particle filter-based and graph-based SLAM. We compare their performances in terms of the latency, throughput gain, and memory consumption, considering their algorithmic characteristics, and confirm that the accelerator removes the bottleneck without compromising the accuracy in both methods.

Via

Access Paper or Ask Questions

An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm

May 29, 2020

Keisuke Sugiura, Hiroki Matsutani

Figure 1 for An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm

Figure 2 for An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm

Figure 3 for An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm

Figure 4 for An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm

Abstract:An efficient hardware design of Simultaneous Localization and Mapping (SLAM) methods is of necessity for mobile autonomous robots with limited computational resources. In this paper, we develop a resource-efficient FPGA design for accelerating the scan matching process, which typically exhibits the bottleneck in 2D LiDAR SLAM methods. Scan matching is a process of correcting a robot pose by aligning the latest LiDAR measurements with an occupancy grid map, which encodes the information about the surrounding environment. The proposed design exploits an inherent parallelism in the Rao-Blackwellized Particle Filter (RBPF) based algorithms to perform scan matching computations for multiple particles in parallel. In the design, map compression technique and lookup-table are employed to reduce the resource utilization and achieve the maximum throughput. Simulation results using the benchmark datasets show that the scan matching is accelerated by 23.3-51.1x and the overall throughput is improved by 1.97-3.16x without seriously degrading the quality of the final outputs. Furthermore, our implementation requires only 37% of the total resources available in the Xilinx ZCU104 evaluation board, thus providing a feasible solution to realize SLAM applications on indoor mobile robots.

Via

Access Paper or Ask Questions