Abstract:3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative algorithms.This comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction tasks.To facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: https://github.com/InternLandMark/LandMarkSystem.
Abstract:Rendering large-scale 3D Gaussian Splatting (3DGS) model faces significant challenges in achieving real-time, high-fidelity performance on consumer-grade devices. Fully realizing the potential of 3DGS in applications such as virtual reality (VR) requires addressing critical system-level challenges to support real-time, immersive experiences. We propose GS-Cache, an end-to-end framework that seamlessly integrates 3DGS's advanced representation with a highly optimized rendering system. GS-Cache introduces a cache-centric pipeline to eliminate redundant computations, an efficiency-aware scheduler for elastic multi-GPU rendering, and optimized CUDA kernels to overcome computational bottlenecks. This synergy between 3DGS and system design enables GS-Cache to achieve up to 5.35x performance improvement, 35% latency reduction, and 42% lower GPU memory usage, supporting 2K binocular rendering at over 120 FPS with high visual quality. By bridging the gap between 3DGS's representation power and the demands of VR systems, GS-Cache establishes a scalable and efficient framework for real-time neural rendering in immersive environments.
Abstract:The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown potential, leveraging FP4 remains challenging due to inherent quantization errors and limited representation capability. Based on the Transformer architecture, we present an FP4 training scheme for LLMs, overcoming these obstacles through mixed-precision quantization strategies tailed for different modules and training stages. This allows us to apply the precision level suitable to distinct components within the model, ensuring that multi-head attention and linear layers are handled appropriately. Our pretraining recipe ensures stability in backpropagation by incorporating fine-grained quantization methods with a target precision training schedule. Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost. With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
Abstract:Recently, 3D Gaussian Splatting (3DGS) has garnered attention for its high fidelity and real-time rendering. However, adapting 3DGS to different camera models, particularly fisheye lenses, poses challenges due to the unique 3D to 2D projection calculation. Additionally, there are inefficiencies in the tile-based splatting, especially for the extreme curvature and wide field of view of fisheye lenses, which are crucial for its broader real-life applications. To tackle these challenges, we introduce Fisheye-GS.This innovative method recalculates the projection transformation and its gradients for fisheye cameras. Our approach can be seamlessly integrated as a module into other efficient 3D rendering methods, emphasizing its extensibility, lightweight nature, and modular design. Since we only modified the projection component, it can also be easily adapted for use with different camera models. Compared to methods that train after undistortion, our approach demonstrates a clear improvement in visual quality.
Abstract:This work introduces FlashGS, an open-source CUDA Python library, designed to facilitate the efficient differentiable rasterization of 3D Gaussian Splatting through algorithmic and kernel-level optimizations. FlashGS is developed based on the observations from a comprehensive analysis of the rendering process to enhance computational efficiency and bring the technique to wide adoption. The paper includes a suite of optimization strategies, encompassing redundancy elimination, efficient pipelining, refined control and scheduling mechanisms, and memory access optimizations, all of which are meticulously integrated to amplify the performance of the rasterization process. An extensive evaluation of FlashGS' performance has been conducted across a diverse spectrum of synthetic and real-world large-scale scenes, encompassing a variety of image resolutions. The empirical findings demonstrate that FlashGS consistently achieves an average 4x acceleration over mobile consumer GPUs, coupled with reduced memory consumption. These results underscore the superior performance and resource optimization capabilities of FlashGS, positioning it as a formidable tool in the domain of 3D rendering.
Abstract:With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.
Abstract:This paper focuses on the area of RGB(visible)-NIR(near-infrared) cross-modality image registration, which is crucial for many downstream vision tasks to fully leverage the complementary information present in visible and infrared images. In this field, researchers face two primary challenges - the absence of a correctly-annotated benchmark with viewpoint variations for evaluating RGB-NIR cross-modality registration methods and the problem of inconsistent local features caused by the appearance discrepancy between RGB-NIR cross-modality images. To address these challenges, we first present the RGB-NIR Image Registration (RGB-NIR-IRegis) benchmark, which, for the first time, enables fair and comprehensive evaluations for the task of RGB-NIR cross-modality image registration. Evaluations of previous methods highlight the significant challenges posed by our RGB-NIR-IRegis benchmark, especially on RGB-NIR image pairs with viewpoint variations. To analyze the causes of the unsatisfying performance, we then design several metrics to reveal the toxic impact of inconsistent local features between visible and infrared images on the model performance. This further motivates us to develop a baseline method named Semantic Guidance Transformer (SGFormer), which utilizes high-level semantic guidance to mitigate the negative impact of local inconsistent features. Despite the simplicity of our motivation, extensive experimental results show the effectiveness of our method.
Abstract:In this paper, we consider the recovery of the high-dimensional block-sparse signal from a compressed set of measurements, where the non-zero coefficients of the recovered signal occur in a small number of blocks. Adopting the idea of deep unfolding, we explore the block-sparse structure and put forward a block-sparse reconstruction network named Ada-BlockLISTA, which performs gradient descent on every single block followed by a block-wise shrinkage. Furthermore, we prove the linear convergence rate of our proposed network, which also theoretically guarantees exact recovery for a potentially higher sparsity level based on underlyingblock structure. Numerical results indicate that Ada-BlockLISTA yields better signal recovery performance compared with existing algorithms, which ignore the additional block structure in the signal model.
Abstract:As a typical signal processing problem, multidimensional harmonic retrieval (MHR) has been adapted to a wide range of applications in signal processing. Block-sparse signals, whose nonzero entries appearing in clusters, have received much attention recently. An unfolded network, named Ada-BlockLISTA, was proposed to recover a block-sparse signal at a small computational cost, which learns an individual weight matrix for each block. However, as the number of network parameters is increasingly associated with the number of blocks, the demand for parameter reduction becomes very significant, especially for large-scale MHR. Based on the dictionary characteristics in two-dimensional (2D) harmonic retrieve problems, we introduce a weight coupling structure to shrink Ada-BlockLISTA, which significantly reduces the number of weights without performance degradation. In simulations, our proposed block-sparse reconstruction network, named AdaBLISTA-CP, shows excellent recovery performance and convergence speed in 2D harmonic retrieval problems.
Abstract:Learned iterative shrinkage thresholding algorithm (LISTA), which adopts deep learning techniques to learn optimal algorithm parameters from labeled training data, can be successfully applied to small-scale multidimensional harmonic retrieval (MHR) problems. However, LISTA computationally demanding for large-scale MHR problems because the matrix size of the learned mutual inhibition matrix exhibits quadratic growth with the signal length. These large matrices consume costly memory/computation resources and require a huge amount of labeled data for training, restricting the applicability of the LISTA method. In this paper, we show that the mutual inhibition matrix of a MHR problem naturally has a Toeplitz structure, which means that the degrees of freedom (DoF) of the matrix can be reduced from a quadratic order to a linear order. By exploiting this characteristic, we propose a structured LISTA-Toeplitz network, which imposes a Toeplitz structure restriction on the mutual inhibition matrices and applies linear convolution instead of the matrix-vector multiplication involved in the traditional LISTA network. Both simulation and field test for air target detection with radar are carried out to validate the performance of the proposed network. For small-scale MHR problems, LISTAToeplitz exhibits close or even better recovery accuracy than traditional LISTA, while the former significantly reduces the network complexity and requires much less training data. For large-scale MHR problems, where LISTA is difficult to implement due to the huge size of the mutual inhibition matrices, our proposed LISTA-Toeplitz still enjoys desirable recovery performance.