Abstract:In recent years, 3D Gaussian splatting (3D-GS) has emerged as a novel scene representation approach. However, existing vision-only 3D-GS methods often rely on hand-crafted heuristics for point-cloud densification and face challenges in handling occlusions and high GPU memory and computation consumption. LiDAR-Inertial-Visual (LIV) sensor configuration has demonstrated superior performance in localization and dense mapping by leveraging complementary sensing characteristics: rich texture information from cameras, precise geometric measurements from LiDAR, and high-frequency motion data from IMU. Inspired by this, we propose a novel real-time Gaussian-based simultaneous localization and mapping (SLAM) system. Our map system comprises a global Gaussian map and a sliding window of Gaussians, along with an IESKF-based odometry. The global Gaussian map consists of hash-indexed voxels organized in a recursive octree, effectively covering sparse spatial volumes while adapting to different levels of detail and scales. The Gaussian map is initialized through multi-sensor fusion and optimized with photometric gradients. Our system incrementally maintains a sliding window of Gaussians, significantly reducing GPU computation and memory consumption by only optimizing the map within the sliding window. Moreover, we implement a tightly coupled multi-sensor fusion odometry with an iterative error state Kalman filter (IESKF), leveraging real-time updating and rendering of the Gaussian map. Our system represents the first real-time Gaussian-based SLAM framework deployable on resource-constrained embedded systems, demonstrated on the NVIDIA Jetson Orin NX platform. The framework achieves real-time performance while maintaining robust multi-sensor fusion capabilities. All implementation algorithms, hardware designs, and CAD models will be publicly available.
Abstract:Cross-modal localization has drawn increasing attention in recent years, while the visual relocalization in prior LiDAR maps is less studied. Related methods usually suffer from inconsistency between the 2D texture and 3D geometry, neglecting the intensity features in the LiDAR point cloud. In this paper, we propose a cross-modal visual relocalization system in prior LiDAR maps utilizing intensity textures, which consists of three main modules: map projection, coarse retrieval, and fine relocalization. In the map projection module, we construct the database of intensity channel map images leveraging the dense characteristic of panoramic projection. The coarse retrieval module retrieves the top-K most similar map images to the query image from the database, and retains the top-K' results by covisibility clustering. The fine relocalization module applies a two-stage 2D-3D association and a covisibility inlier selection method to obtain robust correspondences for 6DoF pose estimation. The experimental results on our self-collected datasets demonstrate the effectiveness in both place recognition and pose estimation tasks.
Abstract:The advent of ultra-massive multiple-input-multiple output systems holds great promise for next-generation communications, yet their channels exhibit hybrid far- and near- field beam-squint (HFBS) effect. In this paper, we not only overcome but also harness the HFBS effect to propose an integrated location sensing and communication (ILSC) framework. During the uplink training stage, user terminals (UTs) transmit reference signals for simultaneous channel estimation and location sensing. This stage leverages an elaborately designed hybrid-field projection matrix to overcome the HFBS effect and estimate the channel in compressive manner. Subsequently, the scatterers' locations can be sensed from the spherical wavefront based on the channel estimation results. By treating the sensed scatterers as virtual anchors, we employ a weighted least-squares approach to derive UT' s location. Moreover, we propose an iterative refinement mechanism, which utilizes the accurately estimated time difference of arrival of multipath components to enhance location sensing precision. In the following downlink data transmission stage, we leverage the acquired location information to further optimize the hybrid beamformer, which combines the beam broadening and focusing to mitigate the spectral efficiency degradation resulted from the HFBS effect. Extensive simulation experiments demonstrate that the proposed ILSC scheme has superior location sensing and communication performance than conventional methods.
Abstract:Autonomous parking is a crucial task in the intelligent driving field. Traditional parking algorithms are usually implemented using rule-based schemes. However, these methods are less effective in complex parking scenarios due to the intricate design of the algorithms. In contrast, neural-network-based methods tend to be more intuitive and versatile than the rule-based methods. By collecting a large number of expert parking trajectory data and emulating human strategy via learning-based methods, the parking task can be effectively addressed. In this paper, we employ imitation learning to perform end-to-end planning from RGB images to path planning by imitating human driving trajectories. The proposed end-to-end approach utilizes a target query encoder to fuse images and target features, and a transformer-based decoder to autoregressively predict future waypoints. We conducted extensive experiments in real-world scenarios, and the results demonstrate that the proposed method achieved an average parking success rate of 87.8% across four different real-world garages. Real-vehicle experiments further validate the feasibility and effectiveness of the method proposed in this paper.
Abstract:Bird's-eye-view (BEV) representation is crucial for the perception function in autonomous driving tasks. It is difficult to balance the accuracy, efficiency and range of BEV representation. The existing works are restricted to a limited perception range within 50 meters. Extending the BEV representation range can greatly benefit downstream tasks such as topology reasoning, scene understanding, and planning by offering more comprehensive information and reaction time. The Standard-Definition (SD) navigation maps can provide a lightweight representation of road structure topology, characterized by ease of acquisition and low maintenance costs. An intuitive idea is to combine the close-range visual information from onboard cameras with the beyond line-of-sight (BLOS) environmental priors from SD maps to realize expanded perceptual capabilities. In this paper, we propose BLOS-BEV, a novel BEV segmentation model that incorporates SD maps for accurate beyond line-of-sight perception, up to 200m. Our approach is applicable to common BEV architectures and can achieve excellent results by incorporating information derived from SD maps. We explore various feature fusion schemes to effectively integrate the visual BEV representations and semantic features from the SD map, aiming to leverage the complementary information from both sources optimally. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in BEV segmentation on nuScenes and Argoverse benchmark. Through multi-modal inputs, BEV segmentation is significantly enhanced at close ranges below 50m, while also demonstrating superior performance in long-range scenarios, surpassing other methods by over 20% mIoU at distances ranging from 50-200m.
Abstract:Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has emerged as a promising low-cost alternative for localization. Current approaches based on navigation maps can achieve highly accurate localization, but their complex matching strategies lead to unacceptable inference latency that fails to meet the real-time demands. To address these limitations, we propose a novel transformer-based neural re-localization method. Inspired by image registration, our approach performs a coarse-to-fine neural feature registration between navigation map and visual bird's-eye view features. Our method significantly outperforms the current state-of-the-art OrienterNet on both the nuScenes and Argoverse datasets, which is nearly 10%/20% localization accuracy and 30/16 FPS improvement on single-view and surround-view input settings, separately. We highlight that our research presents an HD-map-free localization method for autonomous driving, offering cost-effective, reliable, and scalable performance in challenging driving environments.
Abstract:Recently, Neural Radiance Fields (NeRF) achieved impressive results in novel view synthesis. Block-NeRF showed the capability of leveraging NeRF to build large city-scale models. For large-scale modeling, a mass of image data is necessary. Collecting images from specially designed data-collection vehicles can not support large-scale applications. How to acquire massive high-quality data remains an opening problem. Noting that the automotive industry has a huge amount of image data, crowd-sourcing is a convenient way for large-scale data collection. In this paper, we present a crowd-sourced framework, which utilizes substantial data captured by production vehicles to reconstruct the scene with the NeRF model. This approach solves the key problem of large-scale reconstruction, that is where the data comes from and how to use them. Firstly, the crowd-sourced massive data is filtered to remove redundancy and keep a balanced distribution in terms of time and space. Then a structure-from-motion module is performed to refine camera poses. Finally, images, as well as poses, are used to train the NeRF model in a certain block. We highlight that we present a comprehensive framework that integrates multiple modules, including data selection, sparse 3D reconstruction, sequence appearance embedding, depth supervision of ground surface, and occlusion completion. The complete system is capable of effectively processing and reconstructing high-quality 3D scenes from crowd-sourced data. Extensive quantitative and qualitative experiments were conducted to validate the performance of our system. Moreover, we proposed an application, named first-view navigation, which leveraged the NeRF model to generate 3D street view and guide the driver with a synthesized video.
Abstract:In massive multiple-input multiple-output (MIMO) systems, how to reliably acquire downlink channel state information (CSI) with low overhead is challenging. In this work, by integrating the generative pre-trained Transformer (GPT) with federated-tuning, we propose a CSI-GPT approach to realize efficient downlink CSI acquisition. Specifically, we first propose a Swin Transformer-based channel acquisition network (SWTCAN) to acquire downlink CSI, where pilot signals, downlink channel estimation, and uplink CSI feedback are jointly designed. Furthermore, to solve the problem of insufficient training data, we propose a variational auto-encoder-based channel sample generator (VAE-CSG), which can generate sufficient CSI samples based on a limited number of high-quality CSI data obtained from the current cell. The CSI dataset generated from VAE-CSG will be used for pre-training SWTCAN. To fine-tune the pre-trained SWTCAN for improved performance, we propose an online federated-tuning method, where only a small amount of SWTCAN parameters are unfrozen and updated using over-the-air computation, avoiding the high communication overhead caused by aggregating the complete CSI samples from user equipment (UEs) to the BS for centralized fine-tuning. Simulation results verify the advantages of the proposed SWTCAN and the communication efficiency of the proposed federated-tuning method.
Abstract:Reconfigurable massive multiple-input multiple-output (RmMIMO) technology offers increased flexibility for future communication systems by exploiting previously untapped degrees of freedom in the electromagnetic (EM) domain. The representation of the traditional spatial domain channel state information (sCSI) limits the insights into the potential of EM domain channel properties, constraining the base station's (BS) utmost capability for precoding design. This paper leverages the EM domain channel state information (eCSI) for radiation pattern design at the BS. We develop an orthogonal decomposition method based on spherical harmonic functions to decompose the radiation pattern into a linear combination of orthogonal bases. By formulating the radiation pattern design as an optimization problem for the projection coefficients over these bases, we develop a manifold optimization-based method for iterative radiation pattern and digital precoder design. To address the eCSI estimation problem, we capitalize on the inherent structure of the channel. Specifically, we propose a subspace-based scheme to reduce the pilot overhead for wideband sCSI estimation. Given the estimated full-band sCSI, we further employ parameterized methods for angle of arrival estimation. Subsequently, the complete eCSI can be reconstructed after estimating the equivalent channel gain via the least squares method. Simulation results demonstrate that, in comparison to traditional mMIMO systems with fixed antenna radiation patterns, the proposed RmMIMO architecture offers significant throughput gains for multi-user transmission at a low channel estimation overhead.
Abstract:This article presents a comprehensive study on the emerging near-space communications (NS-COM) within the context of space-air-ground-sea integrated network (SAGSIN). Specifically, we firstly explore the recent technical developments of NS-COM, followed by the discussions about motivations behind integrating NS-COM into SAGSIN. To further demonstrate the necessity of NS-COM, a comparative analysis between the NS-COM network and other counterparts in SAGSIN is conducted, covering aspects of deployment, coverage and channel characteristics. Afterwards, the technical aspects of NS-COM, including channel modeling, random access, channel estimation, array-based beam management and joint network optimization, are examined in detail. Furthermore, we explore the potential applications of NS-COM, such as structural expansion in SAGSIN communications, remote and urgent communications, weather monitoring and carbon neutrality. Finally, some promising research avenues are identified, including near-space-ground direct links, reconfigurable multiple input multiple output (MIMO) array, federated learning assisted NS-COM, maritime communication and free space optical (FSO) communication. Overall, this paper highlights that the NS-COM plays an indispensable role in the SAGSIN puzzle, providing substantial performance and coverage enhancement to the traditional SAGSIN architecture.