Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jilin Mei

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Mar 29, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang(+3 more)

Abstract:Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.

* Project page: https://fudan-zvg.github.io/spar

Via

Access Paper or Ask Questions

MASTER: Multimodal Segmentation with Text Prompts

Mar 06, 2025

Fuyang Liu, Shun Lu, Jilin Mei, Yu Hu

Abstract:RGB-Thermal fusion is a potential solution for various weather and light conditions in challenging scenarios. However, plenty of studies focus on designing complex modules to fuse different modalities. With the widespread application of large language models (LLMs), valuable information can be more effectively extracted from natural language. Therefore, we aim to leverage the advantages of large language models to design a structurally simple and highly adaptable multimodal fusion model architecture. We proposed MultimodAl Segmentation with TExt PRompts (MASTER) architecture, which integrates LLM into the fusion of RGB-Thermal multimodal data and allows complex query text to participate in the fusion process. Our model utilizes a dual-path structure to extract information from different modalities of images. Additionally, we employ LLM as the core module for multimodal fusion, enabling the model to generate learnable codebook tokens from RGB, thermal images, and textual information. A lightweight image decoder is used to obtain semantic segmentation results. The proposed MASTER performs exceptionally well in benchmark tests across various automated driving scenarios, yielding promising results.

Via

Access Paper or Ask Questions

WildOcc: A Benchmark for Off-Road 3D Semantic Occupancy Prediction

Oct 21, 2024

Heng Zhai, Jilin Mei, Chen Min, Liang Chen, Fangzhou Zhao, Yu Hu

Abstract:3D semantic occupancy prediction is an essential part of autonomous driving, focusing on capturing the geometric details of scenes. Off-road environments are rich in geometric information, therefore it is suitable for 3D semantic occupancy prediction tasks to reconstruct such scenes. However, most of researches concentrate on on-road environments, and few methods are designed for off-road 3D semantic occupancy prediction due to the lack of relevant datasets and benchmarks. In response to this gap, we introduce WildOcc, to our knowledge, the first benchmark to provide dense occupancy annotations for off-road 3D semantic occupancy prediction tasks. A ground truth generation pipeline is proposed in this paper, which employs a coarse-to-fine reconstruction to achieve a more realistic result. Moreover, we introduce a multi-modal 3D semantic occupancy prediction framework, which fuses spatio-temporal information from multi-frame images and point clouds at voxel level. In addition, a cross-modality distillation function is introduced, which transfers geometric knowledge from point clouds to image features.

Via

Access Paper or Ask Questions

Autonomous Driving in Unstructured Environments: How Far Have We Come?

Oct 10, 2024

Chen Min, Shubin Si, Xu Wang, Hanzhang Xue, Weizhong Jiang, Yang Liu, Juan Wang, Qingtian Zhu, Qi Zhu, Lun Luo(+27 more)

Figure 1 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 2 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 3 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 4 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Abstract:Research on autonomous driving in unstructured outdoor environments is less advanced than in structured urban settings due to challenges like environmental diversities and scene complexity. These environments-such as rural areas and rugged terrains-pose unique obstacles that are not common in structured urban areas. Despite these difficulties, autonomous driving in unstructured outdoor environments is crucial for applications in agriculture, mining, and military operations. Our survey reviews over 250 papers for autonomous driving in unstructured outdoor environments, covering offline mapping, pose estimation, environmental perception, path planning, end-to-end autonomous driving, datasets, and relevant challenges. We also discuss emerging trends and future research directions. This review aims to consolidate knowledge and encourage further research for autonomous driving in unstructured environments. To support ongoing work, we maintain an active repository with up-to-date literature and open-source projects at: https://github.com/chaytonmin/Survey-Autonomous-Driving-in-Unstructured-Environments.

* Survey paper; 38 pages

Via

Access Paper or Ask Questions

Proto-OOD: Enhancing OOD Object Detection with Prototype Feature Similarity

Sep 09, 2024

Junkun Chen, Jilin Mei, Liang Chen, Fangzhou Zhao, Yu Hu

Figure 1 for Proto-OOD: Enhancing OOD Object Detection with Prototype Feature Similarity

Figure 2 for Proto-OOD: Enhancing OOD Object Detection with Prototype Feature Similarity

Figure 3 for Proto-OOD: Enhancing OOD Object Detection with Prototype Feature Similarity

Figure 4 for Proto-OOD: Enhancing OOD Object Detection with Prototype Feature Similarity

Abstract:The limited training samples for object detectors commonly result in low accuracy out-of-distribution (OOD) object detection. We have observed that feature vectors of the same class tend to cluster tightly in feature space, whereas those of different classes are more scattered. This insight motivates us to leverage feature similarity for OOD detection. Drawing on the concept of prototypes prevalent in few-shot learning, we introduce a novel network architecture, Proto-OOD, designed for this purpose. Proto-OOD enhances prototype representativeness through contrastive loss and identifies OOD data by assessing the similarity between input features and prototypes. It employs a negative embedding generator to create negative embedding, which are then used to train the similarity module. Proto-OOD achieves significantly lower FPR95 in MS-COCO dataset and higher mAP for Pascal VOC dataset, when utilizing Pascal VOC as ID dataset and MS-COCO as OOD dataset. Additionally, we identify limitations in existing evaluation metrics and propose an enhanced evaluation protocol.

* 14pages

Via

Access Paper or Ask Questions

TeFF: Tracking-enhanced Forgetting-free Few-shot 3D LiDAR Semantic Segmentation

Aug 28, 2024

Junbao Zhou, Jilin Mei, Pengze Wu, Liang Chen, Fangzhou Zhao, Xijun Zhao, Yu Hu

Figure 1 for TeFF: Tracking-enhanced Forgetting-free Few-shot 3D LiDAR Semantic Segmentation

Figure 2 for TeFF: Tracking-enhanced Forgetting-free Few-shot 3D LiDAR Semantic Segmentation

Figure 3 for TeFF: Tracking-enhanced Forgetting-free Few-shot 3D LiDAR Semantic Segmentation

Figure 4 for TeFF: Tracking-enhanced Forgetting-free Few-shot 3D LiDAR Semantic Segmentation

Abstract:In autonomous driving, 3D LiDAR plays a crucial role in understanding the vehicle's surroundings. However, the newly emerged, unannotated objects presents few-shot learning problem for semantic segmentation. This paper addresses the limitations of current few-shot semantic segmentation by exploiting the temporal continuity of LiDAR data. Employing a tracking model to generate pseudo-ground-truths from a sequence of LiDAR frames, our method significantly augments the dataset, enhancing the model's ability to learn on novel classes. However, this approach introduces a data imbalance biased to novel data that presents a new challenge of catastrophic forgetting. To mitigate this, we incorporate LoRA, a technique that reduces the number of trainable parameters, thereby preserving the model's performance on base classes while improving its adaptability to novel classes. This work represents a significant step forward in few-shot 3D LiDAR semantic segmentation for autonomous driving. Our code is available at https://github.com/junbao-zhou/Track-no-forgetting.

Via

Access Paper or Ask Questions

PID: Physics-Informed Diffusion Model for Infrared Image Generation

Jul 12, 2024

Fangyuan Mao, Jilin Mei, Shun Lu, Fuyang Liu, Liang Chen, Fangzhou Zhao, Yu Hu

Figure 1 for PID: Physics-Informed Diffusion Model for Infrared Image Generation

Figure 2 for PID: Physics-Informed Diffusion Model for Infrared Image Generation

Figure 3 for PID: Physics-Informed Diffusion Model for Infrared Image Generation

Figure 4 for PID: Physics-Informed Diffusion Model for Infrared Image Generation

Abstract:Infrared imaging technology has gained significant attention for its reliable sensing ability in low visibility conditions, prompting many studies to convert the abundant RGB images to infrared images. However, most existing image translation methods treat infrared images as a stylistic variation, neglecting the underlying physical laws, which limits their practical application. To address these issues, we propose a Physics-Informed Diffusion (PID) model for translating RGB images to infrared images that adhere to physical laws. Our method leverages the iterative optimization of the diffusion model and incorporates strong physical constraints based on prior knowledge of infrared laws during training. This approach enhances the similarity between translated infrared images and the real infrared domain without increasing extra training parameters. Experimental results demonstrate that PID significantly outperforms existing state-of-the-art methods. Our code is available at https://github.com/fangyuanmao/PID.

Via

Access Paper or Ask Questions

PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Feb 28, 2023

Shun Lu, Yu Hu, Longxing Yang, Zihao Sun, Jilin Mei, Jianchao Tan, Chengru Song

Figure 1 for PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Figure 2 for PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Figure 3 for PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Figure 4 for PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Abstract:Based on the weight-sharing mechanism, one-shot NAS methods train a supernet and then inherit the pre-trained weights to evaluate sub-models, largely reducing the search cost. However, several works have pointed out that the shared weights suffer from different gradient descent directions during training. And we further find that large gradient variance occurs during supernet training, which degrades the supernet ranking consistency. To mitigate this issue, we propose to explicitly minimize the gradient variance of the supernet training by jointly optimizing the sampling distributions of PAth and DAta (PA&DA). We theoretically derive the relationship between the gradient variance and the sampling distributions, and reveal that the optimal sampling probability is proportional to the normalized gradient norm of path and training data. Hence, we use the normalized gradient norm as the importance indicator for path and training data, and adopt an importance sampling strategy for the supernet training. Our method only requires negligible computation cost for optimizing the sampling distributions of path and data, but achieves lower gradient variance during supernet training and better generalization performance for the supernet, resulting in a more consistent NAS. We conduct comprehensive comparisons with other improved approaches in various search spaces. Results show that our method surpasses others with more reliable ranking performance and higher accuracy of searched architectures, showing the effectiveness of our method. Code is available at https://github.com/ShunLu91/PA-DA.

* To appear in CVPR 2023; we will update the camera-ready version soon

Via

Access Paper or Ask Questions

Few-shot 3D LiDAR Semantic Segmentation for Autonomous Driving

Feb 17, 2023

Jilin Mei, Junbao Zhou, Yu Hu

Abstract:In autonomous driving, the novel objects and lack of annotations challenge the traditional 3D LiDAR semantic segmentation based on deep learning. Few-shot learning is a feasible way to solve these issues. However, currently few-shot semantic segmentation methods focus on camera data, and most of them only predict the novel classes without considering the base classes. This setting cannot be directly applied to autonomous driving due to safety concerns. Thus, we propose a few-shot 3D LiDAR semantic segmentation method that predicts both novel classes and base classes simultaneously. Our method tries to solve the background ambiguity problem in generalized few-shot semantic segmentation. We first review the original cross-entropy and knowledge distillation losses, then propose a new loss function that incorporates the background information to achieve 3D LiDAR few-shot semantic segmentation. Extensive experiments on SemanticKITTI demonstrate the effectiveness of our method.

Via

Access Paper or Ask Questions

Scene Context Based Semantic Segmentation for 3D LiDAR Data in Dynamic Scene

Mar 31, 2020

Jilin Mei, Huijing Zhao

Figure 1 for Scene Context Based Semantic Segmentation for 3D LiDAR Data in Dynamic Scene

Figure 2 for Scene Context Based Semantic Segmentation for 3D LiDAR Data in Dynamic Scene

Figure 3 for Scene Context Based Semantic Segmentation for 3D LiDAR Data in Dynamic Scene

Figure 4 for Scene Context Based Semantic Segmentation for 3D LiDAR Data in Dynamic Scene

Abstract:We propose a graph neural network(GNN) based method to incorporate scene context for the semantic segmentation of 3D LiDAR data. The problem is defined as building a graph to represent the topology of a center segment with its neighborhoods, then inferring the segment label. The node of graph is generated from the segment on range image, which is suitable for both sparse and dense point cloud. Edge weights that evaluate the correlations of center node and its neighborhoods are automatically encoded by a neural network, therefore the number of neighbor nodes is no longer a sensitive parameter. A system consists of segment generation, graph building, edge weight estimation, node updating, and node prediction is designed. Quantitative evaluation on a dataset of dynamic scene shows that our method has better performance than unary CNN with 8% improvement, as well as normal GNN with 17% improvement.

* 8 pages

Via

Access Paper or Ask Questions