Abstract:Multi-object tracking (MOT) is one of the most important problems in computer vision and a key component of any vision-based perception system used in advanced autonomous mobile robotics. Therefore, its implementation on low-power and real-time embedded platforms is highly desirable. Modern MOT algorithms should be able to track objects of a given class (e.g. people or vehicles). In addition, the number of objects to be tracked is not known in advance, and they may appear and disappear at any time, as well as be obscured. For these reasons, the most popular and successful approaches have recently been based on the tracking paradigm. Therefore, the presence of a high quality object detector is essential, which in practice accounts for the vast majority of the computational and memory complexity of the whole MOT system. In this paper, we propose an FPGA (Field-Programmable Gate Array) implementation of an embedded MOT system based on a quantized YOLOv8 detector and the SORT (Simple Online Realtime Tracker) tracker. We use a modified version of the FINN framework to utilize external memory for model parameters and to support operations necessary required by YOLOv8. We discuss the evaluation of detection and tracking performance using the COCO and MOT15 datasets, where we achieve 0.21 mAP and 38.9 MOTA respectively. As the computational platform, we use an MPSoC system (Zynq UltraScale+ device from AMD/Xilinx) where the detector is deployed in reprogrammable logic and the tracking algorithm is implemented in the processor system.
Abstract:The use of unmanned aerial vehicles (UAVs) for smart agriculture is becoming increasingly popular. This is evidenced by recent scientific works, as well as the various competitions organised on this topic. Therefore, in this work we present a system for automatic fruit counting using UAVs. To detect them, our solution uses a vision algorithm that processes streams from an RGB camera and a depth sensor using classical image operations. Our system also allows the planning and execution of flight trajectories, taking into account the minimisation of flight time and distance covered. We tested the proposed solution in simulation and obtained an average score of 87.27/100 points from a total of 500 missions. We also submitted it to the UAV Competition organised as part of the ICUAS 2024 conference, where we achieved an average score of 84.83/100 points, placing 6th in a field of 23 teams and advancing to the finals.
Abstract:As the quantities of data recorded by embedded edge sensors grow, so too does the need for intelligent local processing. Such data often comes in the form of time-series signals, based on which real-time predictions can be made locally using an AI model. However, a hardware-software approach capable of making low-latency predictions with low power consumption is required. In this paper, we present a hardware implementation of an event-graph neural network for time-series classification. We leverage an artificial cochlea model to convert the input time-series signals into a sparse event-data format that allows the event-graph to drastically reduce the number of calculations relative to other AI methods. We implemented the design on a SoC FPGA and applied it to the real-time processing of the Spiking Heidelberg Digits (SHD) dataset to benchmark our approach against competitive solutions. Our method achieves a floating-point accuracy of 92.7% on the SHD dataset for the base model, which is only 2.4% and 2% less than the state-of-the-art models with over 10% and 67% fewer model parameters, respectively. It also outperforms FPGA-based spiking neural network implementations by 19.3% and 4.5%, achieving 92.3% accuracy for the quantised model while using fewer computational resources and reducing latency.
Abstract:This paper presents LiFT, a lightweight, fully quantized 3D object detection algorithm for LiDAR data, optimized for real-time inference on FPGA platforms. Through an in-depth analysis of FPGA-specific limitations, we identify a set of FPGA-induced constraints that shape the algorithm's design. These include a computational complexity limit of 30 GMACs (billion multiply-accumulate operations), INT8 quantization for weights and activations, 2D cell-based processing instead of 3D voxels, and minimal use of skip connections. To meet these constraints while maximizing performance, LiFT combines novel mechanisms with state-of-the-art techniques such as reparameterizable convolutions and fully sparse architecture. Key innovations include the Dual-bound Pillar Feature Net, which boosts performance without increasing complexity, and an efficient scheme for INT8 quantization of input features. With a computational cost of just 20.73 GMACs, LiFT stands out as one of the few algorithms targeting minimal-complexity 3D object detection. Among comparable methods, LiFT ranks first, achieving an mAP of 51.84% and an NDS of 61.01% on the challenging NuScenes validation dataset. The code will be available at https://github.com/vision-agh/lift.
Abstract:Deep reinforcement learning (DRL) is currently the most popular AI-based approach to autonomous vehicle control. An agent, trained for this purpose in simulation, can interact with the real environment with a human-level performance. Despite very good results in terms of selected metrics, this approach has some significant drawbacks: high computational requirements and low explainability. Because of that, a DRL-based agent cannot be used in some control tasks, especially when safety is the key issue. Therefore we propose to use Tangled Program Graphs (TPGs) as an alternative for deep reinforcement learning in control-related tasks. In this approach, input signals are processed by simple programs that are combined in a graph structure. As a result, TPGs are less computationally demanding and their actions can be explained based on the graph structure. In this paper, we present our studies on the use of TPGs as an alternative for DRL in control-related tasks. In particular, we consider the problem of navigating an unmanned aerial vehicle (UAV) through the unknown environment based solely on the on-board LiDAR sensor. The results of our work show promising prospects for the use of TPGs in control related-tasks.
Abstract:Nowadays, the increasing demand for maintaining high cleanliness standards in public spaces results in the search for innovative solutions. The deployment of CCTV systems equipped with modern cameras and software enables not only real-time monitoring of the cleanliness status but also automatic detection of impurities and optimisation of cleaning schedules. The Digital Twin technology allows for the creation of a virtual model of the space, facilitating the simulation, training, and testing of cleanliness management strategies before implementation in the real world. In this paper, we present the utilisation of advanced vision surveillance systems and the Digital Twin technology in cleanliness management, using a railway station as an example. The Digital Twin was created based on an actual 3D model in the Nvidia Omniverse Isaac Sim simulator. A litter detector, bin occupancy level detector, stain segmentation, and a human detector (including the cleaning crew) along with their movement analysis were implemented. A preliminary assessment was conducted, and potential modifications for further enhancement and future development of the system were identified.
Abstract:Event cameras are becoming increasingly popular as an alternative to traditional frame-based vision sensors, especially in mobile robotics. Taking full advantage of their high temporal resolution, high dynamic range, low power consumption and sparsity of event data, which only reflects changes in the observed scene, requires both an efficient algorithm and a specialised hardware platform. A recent trend involves using Graph Convolutional Neural Networks (GCNNs) implemented on a heterogeneous SoC FPGA. In this paper we focus on optimising hardware modules for graph convolution to allow flexible selection of the FPGA resource (BlockRAM, DSP and LUT) for their implementation. We propose a ''two-step convolution'' approach that utilises additional BRAM buffers in order to reduce up to 94% of LUT usage for multiplications. This method significantly improves the scalability of GCNNs, enabling the deployment of models with more layers, larger graphs sizes and their application for more dynamic scenarios.
Abstract:In recent years there has been a growing interest in event cameras, i.e. vision sensors that record changes in illumination independently for each pixel. This type of operation ensures that acquisition is possible in very adverse lighting conditions, both in low light and high dynamic range, and reduces average power consumption. In addition, the independent operation of each pixel results in low latency, which is desirable for robotic solutions. Nowadays, Field Programmable Gate Arrays (FPGAs), along with general-purpose processors (GPPs/CPUs) and programmable graphics processing units (GPUs), are popular architectures for implementing and accelerating computing tasks. In particular, their usefulness in the embedded vision domain has been repeatedly demonstrated over the past 30 years, where they have enabled fast data processing (even in real-time) and energy efficiency. Hence, the combination of event cameras and reconfigurable devices seems to be a good solution, especially in the context of energy-efficient real-time embedded systems. This paper gives an overview of the most important works, where FPGAs have been used in different contexts to process event data. It covers applications in the following areas: filtering, stereovision, optical flow, acceleration of AI-based algorithms (including spiking neural networks) for object classification, detection and tracking, and applications in robotics and inspection systems. Current trends and challenges for such systems are also discussed.
Abstract:The performance of object detection systems in automotive solutions must be as high as possible, with minimal response time and, due to the often battery-powered operation, low energy consumption. When designing such solutions, we therefore face challenges typical for embedded vision systems: the problem of fitting algorithms of high memory and computational complexity into small low-power devices. In this paper we propose PowerYOLO - a mixed precision solution, which targets three essential elements of such application. First, we propose a system based on a Dynamic Vision Sensor (DVS), a novel sensor, that offers low power requirements and operates well in conditions with variable illumination. It is these features that may make event cameras a preferential choice over frame cameras in some applications. Second, to ensure high accuracy and low memory and computational complexity, we propose to use 4-bit width Powers-of-Two (PoT) quantisation for convolution weights of the YOLO detector, with all other parameters quantised linearly. Finally, we embrace from PoT scheme and replace multiplication with bit-shifting to increase the efficiency of hardware acceleration of such solution, with a special convolution-batch normalisation fusion scheme. The use of specific sensor with PoT quantisation and special batch normalisation fusion leads to a unique system with almost 8x reduction in memory complexity and vast computational simplifications, with relation to a standard approach. This efficient system achieves high accuracy of mAP 0.301 on the GEN1 DVS dataset, marking the new state-of-the-art for such compressed model.
Abstract:The utilisation of event cameras represents an important and swiftly evolving trend aimed at addressing the constraints of traditional video systems. Particularly within the automotive domain, these cameras find significant relevance for their integration into embedded real-time systems due to lower latency and energy consumption. One effective approach to ensure the necessary throughput and latency for event processing systems is through the utilisation of graph convolutional networks (GCNs). In this study, we introduce a series of hardware-aware optimisations tailored for PointNet++, a GCN architecture designed for point cloud processing. The proposed techniques result in more than a 100-fold reduction in model size compared to Asynchronous Event-based GNN (AEGNN), one of the most recent works in the field, with a relatively small decrease in accuracy (2.3% for N-Caltech101 classification, 1.7% for N-Cars classification), thus following the TinyML trend. Based on software research, we designed a custom EFGCN (Event-Based FPGA-accelerated Graph Convolutional Network) and we implemented it on ZCU104 SoC FPGA platform, achieving a throughput of 13.3 million events per second (MEPS) and real-time partially asynchronous processing with a latency of 4.47 ms. We also address the scalability of the proposed hardware model to improve the obtained accuracy score. To the best of our knowledge, this study marks the first endeavour in accelerating PointNet++ networks on SoC FPGAs, as well as the first hardware architecture exploration of graph convolutional networks implementation for real-time continuous event data processing. We publish both software and hardware source code in an open repository: https://github.com/vision-agh/*** (will be published upon acceptance).