Abstract:OpenStreetMap (OSM), an online and versatile source of volunteered geographic information (VGI), is widely used for human self-localization by matching nearby visual observations with vectorized map data. However, due to the divergence in modalities and views, image-to-OSM (I2O) matching and localization remain challenging for robots, preventing the full utilization of VGI data in the unmanned ground vehicles and logistic industry. Inspired by the fact that the human brain relies on geometric and semantic understanding of sensory information for spatial localization tasks, we propose the OSMLoc in this paper. OSMLoc is a brain-inspired single-image visual localization method with semantic and geometric guidance to improve accuracy, robustness, and generalization ability. First, we equip the OSMLoc with the visual foundational model to extract powerful image features. Second, a geometry-guided depth distribution adapter is proposed to bridge the monocular depth estimation and camera-to-BEV transform. Thirdly, the semantic embeddings from the OSM data are utilized as auxiliary guidance for image-to-OSM feature matching. To validate the proposed OSMLoc, we collect a worldwide cross-area and cross-condition (CC) benchmark for extensive evaluation. Experiments on the MGL dataset, CC validation benchmark, and KITTI dataset have demonstrated the superiority of our method. Code, pre-trained models, CC validation benchmark, and additional results are available on: https://github.com/WHU-USI3DV/OSMLoc
Abstract:Wearable laser scanning (WLS) system has the advantages of flexibility and portability. It can be used for determining the user's path within a prior map, which is a huge demand for applications in pedestrian navigation, collaborative mapping, augmented reality, and emergency rescue. However, existing LiDAR-based global localization methods suffer from insufficient robustness, especially in complex large-scale outdoor scenes with insufficient features and incomplete coverage of the prior map. To address such challenges, we propose LiDAR-based reliable global localization (Reliable-loc) exploiting the verifiable cues in the sequential LiDAR data. First, we propose a Monte Carlo Localization (MCL) based on spatially verifiable cues, utilizing the rich information embedded in local features to adjust the particles' weights hence avoiding the particles converging to erroneous regions. Second, we propose a localization status monitoring mechanism guided by the sequential pose uncertainties and adaptively switching the localization mode using the temporal verifiable cues to avoid the crash of the localization system. To validate the proposed Reliable-loc, comprehensive experiments have been conducted on a large-scale heterogeneous point cloud dataset consisting of high-precision vehicle-mounted mobile laser scanning (MLS) point clouds and helmet-mounted WLS point clouds, which cover various street scenes with a length of over 20km. The experimental results indicate that Reliable-loc exhibits high robustness, accuracy, and efficiency in large-scale, complex street scenes, with a position accuracy of 1.66m, yaw accuracy of 3.09 degrees, and achieves real-time performance. For the code and detailed experimental results, please refer to https://github.com/zouxianghong/Reliable-loc.
Abstract:One of the major bottlenecks for efficient deployment of neural network based recommendation systems is the memory footprint of their embedding tables. Although many neural network based recommendation systems could benefit from the faster on-chip memory access and increased computational power of hardware accelerators, the large embedding tables in these models often cannot fit on the constrained memory of accelerators. Despite the pervasiveness of these models, prior methods in memory optimization and parallelism fail to address the memory and communication costs of large embedding tables on accelerators. As a result, the majority of models are trained on CPUs, while current implementations of accelerators are hindered by issues such as bottlenecks in inter-device communication and main memory lookups. In this paper, we propose a theoretical framework that analyses the communication costs of arbitrary distributed systems that use lookup tables. We use this framework to propose algorithms that maximize throughput subject to memory, computation, and communication constraints. Furthermore, we demonstrate that our method achieves strong theoretical performance across dataset distributions and memory constraints, applicable to a wide range of use cases from mobile federated learning to warehouse-scale computation. We implement our framework and algorithms in PyTorch and achieve up to 6x increases in training throughput on GPU systems over baselines, on the Criteo Terabytes dataset.
Abstract:Large-scale recommendation models are currently the dominant workload for many large Internet companies. These recommenders are characterized by massive embedding tables that are sparsely accessed by the index for user and item features. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. In this work, we propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM). The proposed framework makes inference more efficient on the cloud servers, explores the possibility of deploying powerful recommenders on smaller edge devices, and optimizes the workload of the communication overhead in distributed training under the data parallelism settings. Specifically, we show that quantization-aware training (QAT) can impose a strong regularization effect to mitigate the severe overfitting issues suffered by DLRMs. Consequently, we achieved INT4 quantization of DLRM models without any accuracy drop. We further propose two techniques that improve and accelerate the conventional QAT workload specifically for the embedding tables in the recommendation models. Furthermore, to achieve efficient training, we quantize the gradients of the embedding tables into INT8 on top of the well-supported specified sparsification. We show that combining gradient sparsification and quantization together significantly reduces the amount of communication. Briefly, DQRM models with INT4 can achieve 79.07% accuracy on Kaggle with 0.27 GB model size, and 81.21% accuracy on the Terabyte dataset with 1.57 GB, which even outperform FP32 DLRMs that have much larger model sizes (2.16 GB on Kaggle and 12.58 on Terabyte).
Abstract:In this paper, we propose VistaDream a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream begins with building a global coarse 3D scaffold by zooming out a little step with inpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multiview Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin. The code, videos, and interactive demos are available at https://vistadream-project-page.github.io/.
Abstract:Diffusion models, such as Stable Diffusion, have made significant strides in visual generation, yet their paradigm remains fundamentally different from autoregressive language models, complicating the development of unified language-vision models. Recent efforts like LlamaGen have attempted autoregressive image generation using discrete VQVAE tokens, but the large number of tokens involved renders this approach inefficient and slow. In this work, we present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL. By incorporating a comprehensive suite of architectural innovations, advanced positional encoding strategies, and optimized sampling conditions, Meissonic substantially improves MIM's performance and efficiency. Additionally, we leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images. Extensive experiments validate Meissonic's capabilities, demonstrating its potential as a new standard in text-to-image synthesis. We release a model checkpoint capable of producing $1024 \times 1024$ resolution images.
Abstract:Dashboard cameras (dashcams) record millions of driving videos daily, offering a valuable potential data source for various applications, including driving map production and updates. A necessary step for utilizing these dashcam data involves the estimation of camera poses. However, the low-quality images captured by dashcams, characterized by motion blurs and dynamic objects, pose challenges for existing image-matching methods in accurately estimating camera poses. In this study, we propose a precise pose estimation method for dashcam images, leveraging the inherent camera motion prior. Typically, image sequences captured by dash cameras exhibit pronounced motion prior, such as forward movement or lateral turns, which serve as essential cues for correspondence estimation. Building upon this observation, we devise a pose regression module aimed at learning camera motion prior, subsequently integrating these prior into both correspondences and pose estimation processes. The experiment shows that, in real dashcams dataset, our method is 22% better than the baseline for pose estimation in AUC5\textdegree, and it can estimate poses for 19% more images with less reprojection error in Structure from Motion (SfM).
Abstract:The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena
Abstract:Corresponding author}In this paper, we explore a novel framework, EGIInet (Explicitly Guided Information Interaction Network), a model for View-guided Point cloud Completion (ViPC) task, which aims to restore a complete point cloud from a partial one with a single view image. In comparison with previous methods that relied on the global semantics of input images, EGIInet efficiently combines the information from two modalities by leveraging the geometric nature of the completion task. Specifically, we propose an explicitly guided information interaction strategy supported by modal alignment for point cloud completion. First, in contrast to previous methods which simply use 2D and 3D backbones to encode features respectively, we unified the encoding process to promote modal alignment. Second, we propose a novel explicitly guided information interaction strategy that could help the network identify critical information within images, thus achieving better guidance for completion. Extensive experiments demonstrate the effectiveness of our framework, and we achieved a new state-of-the-art (+16\% CD over XMFnet) in benchmark datasets despite using fewer parameters than the previous methods. The pre-trained model and code and are available at https://github.com/WHU-USI3DV/EGIInet.
Abstract:The impact of quantization on the overall performance of deep learning models is a well-studied problem. However, understanding and mitigating its effects on a more fine-grained level is still lacking, especially for harder tasks such as object detection with both classification and regression objectives. This work defines the performance for a subset of task-critical categories, i.e. the critical-category performance, as a crucial yet largely overlooked fine-grained objective for detection tasks. We analyze the impact of quantization at the category-level granularity, and propose methods to improve performance for the critical categories. Specifically, we find that certain critical categories have a higher sensitivity to quantization, and are prone to overfitting after quantization-aware training (QAT). To explain this, we provide theoretical and empirical links between their performance gaps and the corresponding loss landscapes with the Fisher information framework. Using this evidence, we apply a Fisher-aware mixed-precision quantization scheme, and a Fisher-trace regularization for the QAT on the critical-category loss landscape. The proposed methods improve critical-category metrics of the quantized transformer-based DETR detectors. They are even more significant in case of larger models and higher number of classes where the overfitting becomes more severe. For example, our methods lead to 10.4% and 14.5% mAP gains for, correspondingly, 4-bit DETR-R50 and Deformable DETR on the most impacted critical classes in the COCO Panoptic dataset.