Abstract:Implicit Neural Representations (INRs) and Neural Fields are a novel paradigm for signal representation, from images and audio to 3D scenes and videos. The fundamental idea is to represent a signal as a continuous and differentiable neural network. This idea offers unprecedented benefits such as continuous resolution and memory efficiency, enabling new compression techniques. However, representing data as neural networks poses new challenges. For instance, given a 2D image as a neural network, how can we further compress such a neural image?. In this work, we present a novel analysis on compressing neural fields, with the focus on images. We also introduce Adaptive Neural Images (ANI), an efficient neural representation that enables adaptation to different inference or transmission requirements. Our proposed method allows to reduce the bits-per-pixel (bpp) of the neural image by 4x, without losing sensitive details or harming fidelity. We achieve this thanks to our successful implementation of 4-bit neural representations. Our work offers a new framework for developing compressed neural fields.
Abstract:Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware. We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted. However, existing methods face difficulties in optimizing the mixed-precision search space and incurring large memory costs during training. To tackle these challenges, first, we study the effective search space design for fine-tuning a VFM by comparing different operators (such as resolution, feature size, width, depth, and bit-widths) in terms of performance and BitOPs reduction. Second, we propose memory-efficient supernet training using a low-rank adapter (LoRA) and a progressive training strategy. The proposed method is evaluated for the recently proposed VFM, Segment Anything Model, fine-tuned on segmentation tasks. The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.
Abstract:We propose a light-weight and highly efficient Joint Detection and Tracking pipeline for the task of Multi-Object Tracking using a fully-transformer architecture. It is a modified version of TransTrack, which overcomes the computational bottleneck associated with its design, and at the same time, achieves state-of-the-art MOTA score of 73.20%. The model design is driven by a transformer based backbone instead of CNN, which is highly scalable with the input resolution. We also propose a drop-in replacement for Feed Forward Network of transformer encoder layer, by using Butterfly Transform Operation to perform channel fusion and depth-wise convolution to learn spatial context within the feature maps, otherwise missing within the attention maps of the transformer. As a result of our modifications, we reduce the overall model size of TransTrack by 58.73% and the complexity by 78.72%. Therefore, we expect our design to provide novel perspectives for architecture optimization in future research related to multi-object tracking.
Abstract:Image signal processor (ISP) plays an important role not only for human perceptual quality but also for computer vision. In most cases, experts resort to manual tuning of many parameters in the ISPs for perceptual quality. It failed in sub-optimal, especially for computer vision. Aiming to improve ISPs, two approaches have been actively proposed; tuning the parameters with machine learning, or constructing an ISP with DNN. The former is lightweight but lacks expressive powers. The latter has expressive powers but it was too heavy to calculate on edge devices. To this end, we propose DynamicISP, which consists of traditional simple ISP functions but their parameters are controlled dynamically per image according to what the downstream image recognition model felt to the previous frame. Our proposed method successfully controlled parameters of multiple ISP functions and got state-of-the-art accuracy with a small computational cost.
Abstract:Image recognition models that can work in challenging environments (e.g., extremely dark, blurry, or high dynamic range conditions) must be useful. However, creating a training dataset for such environments is expensive and hard due to the difficulties of data collection and annotation. It is desirable if we could get a robust model without the need of hard-to-obtain dataset. One simple approach is to apply data augmentation such as color jitter and blur to standard RGB (sRGB) images in simple scenes. Unfortunately, this approach struggles to yield realistic images in terms of pixel intensity and noise distribution due to not considering the non-linearity of Image Signal Processor (ISP) and noise characteristics of an image sensor. Instead, we propose a noise-accounted RAW image augmentation method. In essence, color jitter and blur augmentation are applied to a RAW image before applying non-linear ISP, yielding realistic intensity. Furthermore, we introduce a noise amount alignment method that calibrates the domain gap in noise property caused by the augmentation. We show that our proposed noise-accounted RAW augmentation method doubles the image recognition accuracy in challenging environments only with simple training data.