Abstract:Anonymization plays a key role in protecting sensible information of individuals in real world datasets. Self-driving cars for example need high resolution facial features to track people and their viewing direction to predict future behaviour and react accordingly. In order to protect people's privacy whilst keeping important features in the dataset, it is important to replace the full body of a person with a highly detailed anonymized one. In contrast to doing face anonymization, full body replacement decreases the ability of recognizing people by their hairstyle or clothes. In this paper, we propose a workflow for full body person anonymization utilizing Stable Diffusion as a generative backend. Text-to-image diffusion models, like Stable Diffusion, OpenAI's DALL-E or Midjourney, have become very popular in recent time, being able to create photorealistic images from a single text prompt. We show that our method outperforms state-of-the art anonymization pipelines with respect to image quality, resolution, Inception Score (IS) and Frechet Inception Distance (FID). Additionally, our method is invariant with respect to the image generator and thus able to be used with the latest models available.
Abstract:Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.
Abstract:The growing concerns regarding energy consumption and privacy have prompted the development of AI solutions deployable on the edge, circumventing the substantial CO2 emissions associated with cloud servers and mitigating risks related to sharing sensitive data. But deploying Convolutional Neural Networks (CNNs) on non-off-the-shelf edge devices remains a complex and labor-intensive task. In this paper, we present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator, which we modified for efficient implementation on FPGAs. We describe how we leverage the use of open source software on each optimization step of the deployment process, the customizations we added to them and its impact on the final system's performance. We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W. Our FPGA-based solution demonstrates superior power efficiency compared with other embedded hardware devices, and even outperforms other FPGA reference implementations. Finally, we present how this kind of solution can be integrated into a wider system, by testing our proposed platform in a traffic monitoring scenario.
Abstract:The safe operation of automated vehicles depends on their ability to perceive the environment comprehensively. However, occlusion, sensor range, and environmental factors limit their perception capabilities. To overcome these limitations, collective perception enables vehicles to exchange information. However, fusing this exchanged information is a challenging task. Early fusion approaches require large amounts of bandwidth, while intermediate fusion approaches face interchangeability issues. Late fusion of shared detections is currently the only feasible approach. However, it often results in inferior performance due to information loss. To address this issue, we propose MR3D-Net, a dynamic multi-resolution 3D sparse voxel grid fusion backbone architecture for LiDAR-based collective perception. We show that sparse voxel grids at varying resolutions provide a meaningful and compact environment representation that can adapt to the communication bandwidth. MR3D-Net achieves state-of-the-art performance on the OPV2V 3D object detection benchmark while reducing the required bandwidth by up to 94% compared to early fusion. Code is available at https://github.com/ekut-es/MR3D-Net
Abstract:Collective perception has received considerable attention as a promising approach to overcome occlusions and limited sensing ranges of vehicle-local perception in autonomous driving. In order to develop and test novel collective perception technologies, appropriate datasets are required. These datasets must include not only different environmental conditions, as they strongly influence the perception capabilities, but also a wide range of scenarios with different road users as well as realistic sensor models. Therefore, we propose the Synthetic COllective PErception (SCOPE) dataset. SCOPE is the first synthetic multi-modal dataset that incorporates realistic camera and LiDAR models as well as parameterized and physically accurate weather simulations for both sensor types. The dataset contains 17,600 frames from over 40 diverse scenarios with up to 24 collaborative agents, infrastructure sensors, and passive traffic, including cyclists and pedestrians. In addition, recordings from two novel digital-twin maps from Karlsruhe and T\"ubingen are included. The dataset is available at https://ekut-es.github.io/scope
Abstract:Comprehensive perception of the vehicle's environment and correct interpretation of the environment are crucial for the safe operation of autonomous vehicles. The perception of surrounding objects is the main component for further tasks such as trajectory planning. However, safe trajectory planning requires not only object detection, but also the detection of drivable areas and lane corridors. While first approaches consider an advanced safety evaluation of object detection, the evaluation of lane detection still lacks sufficient safety metrics. Similar to the safety metrics for object detection, additional factors such as the semantics of the scene with road type and road width, the detection range as well as the potential causes of missing detections, incorporated by vehicle speed, should be considered for the evaluation of lane detection. Therefore, we propose the Lane Safety Metric (LSM), which takes these factors into account and allows to evaluate the safety of lane detection systems by determining an easily interpretable safety score. We evaluate our offline safety metric on various virtual scenarios using different lane detection approaches and compare it with state-of-the-art performance metrics.
Abstract:Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.
Abstract:Epilepsy is the most common, chronic, neurological disease worldwide and is typically accompanied by reoccurring seizures. Neuro implants can be used for effective treatment by suppressing an upcoming seizure upon detection. Due to the restricted size and limited battery lifetime of those medical devices, the employed approach also needs to be limited in size and have low energy requirements. We present an energy-efficient seizure detection approach involving a TC-ResNet and time-series analysis which is suitable for low-power edge devices. The presented approach allows for accurate seizure detection without preceding feature extraction while considering the stringent hardware requirements of neural implants. The approach is validated using the CHB-MIT Scalp EEG Database with a 32-bit floating point model and a hardware suitable 4-bit fixed point model. The presented method achieves an accuracy of 95.28%, a sensitivity of 92.34% and an AUC score of 0.9384 on this dataset with 4-bit fixed point representation. Furthermore, the power consumption of the model is measured with the low-power AI accelerator UltraTrail, which only requires 495 nW on average. Due to this low-power consumption this classification approach is suitable for real-time seizure detection on low-power wearable devices such as neural implants.
Abstract:Statistical models are widely used to estimate the performance of commercial off-the-shelf (COTS) AI hardware accelerators. However, training of statistical performance models often requires vast amounts of data, leading to a significant time investment and can be difficult in case of limited hardware availability. To alleviate this problem, we propose a novel performance modeling methodology that significantly reduces the number of training samples while maintaining good accuracy. Our approach leverages knowledge of the target hardware architecture and initial parameter sweeps to identify a set of Performance Representatives (PR) for deep neural network (DNN) layers. These PRs are then used for benchmarking, building a statistical performance model, and making estimations. This targeted approach drastically reduces the number of training samples needed, opposed to random sampling, to achieve a better estimation accuracy. We achieve a Mean Absolute Percentage Error (MAPE) of as low as 0.02% for single-layer estimations and 0.68% for whole DNN estimations with less than 10000 training samples. The results demonstrate the superiority of our method for single-layer estimations compared to models trained with randomly sampled datasets of the same size.
Abstract:To ensure safe operation of autonomous vehicles in complex urban environments, complete perception of the environment is necessary. However, due to environmental conditions, sensor limitations, and occlusions, this is not always possible from a single point of view. To address this issue, collective perception is an effective method. Realistic and large-scale datasets are essential for training and evaluating collective perception methods. This paper provides the first comprehensive technical review of collective perception datasets in the context of autonomous driving. The survey analyzes existing V2V and V2X datasets, categorizing them based on different criteria such as sensor modalities, environmental conditions, and scenario variety. The focus is on their applicability for the development of connected automated vehicles. This study aims to identify the key criteria of all datasets and to present their strengths, weaknesses, and anomalies. Finally, this survey concludes by making recommendations regarding which dataset is most suitable for collective 3D object detection, tracking, and semantic segmentation.