Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ze Huang

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Mar 29, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang(+3 more)

Figure 1 for From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Figure 2 for From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Figure 3 for From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Figure 4 for From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Abstract:Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.

* Project page: https://fudan-zvg.github.io/spar

Via

Access Paper or Ask Questions

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Dec 06, 2023

Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, Li Zhang

Figure 1 for WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Figure 2 for WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Figure 3 for WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Figure 4 for WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Abstract:Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.

Via

Access Paper or Ask Questions

MLPST: MLP is All You Need for Spatio-Temporal Prediction

Sep 23, 2023

Zijian Zhang, Ze Huang, Zhiwei Hu, Xiangyu Zhao, Wanyu Wang, Zitao Liu, Junbo Zhang, S. Joe Qin, Hongwei Zhao

Figure 1 for MLPST: MLP is All You Need for Spatio-Temporal Prediction

Figure 2 for MLPST: MLP is All You Need for Spatio-Temporal Prediction

Figure 3 for MLPST: MLP is All You Need for Spatio-Temporal Prediction

Figure 4 for MLPST: MLP is All You Need for Spatio-Temporal Prediction

Abstract:Traffic prediction is a typical spatio-temporal data mining task and has great significance to the public transportation system. Considering the demand for its grand application, we recognize key factors for an ideal spatio-temporal prediction method: efficient, lightweight, and effective. However, the current deep model-based spatio-temporal prediction solutions generally own intricate architectures with cumbersome optimization, which can hardly meet these expectations. To accomplish the above goals, we propose an intuitive and novel framework, MLPST, a pure multi-layer perceptron architecture for traffic prediction. Specifically, we first capture spatial relationships from both local and global receptive fields. Then, temporal dependencies in different intervals are comprehensively considered. Through compact and swift MLP processing, MLPST can well capture the spatial and temporal dependencies while requiring only linear computational complexity, as well as model parameters that are more than an order of magnitude lower than baselines. Extensive experiments validated the superior effectiveness and efficiency of MLPST against advanced baselines, and among models with optimal accuracy, MLPST achieves the best time and space efficiency.

Via

Access Paper or Ask Questions

A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition

Nov 27, 2022

Rui Huang, Ze Huang, Songzhi Su

Figure 1 for A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition

Figure 2 for A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition

Figure 3 for A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition

Figure 4 for A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition

Abstract:Visual Place Recognition is an essential component of systems for camera localization and loop closure detection, and it has attracted widespread interest in multiple domains such as computer vision, robotics and AR/VR. In this work, we propose a faster, lighter and stronger approach that can generate models with fewer parameters and can spend less time in the inference stage. We designed RepVGG-lite as the backbone network in our architecture, it is more discriminative than other general networks in the Place Recognition task. RepVGG-lite has more speed advantages while achieving higher performance. We extract only one scale patch-level descriptors from global descriptors in the feature extraction stage. Then we design a trainable feature matcher to exploit both spatial relationships of the features and their visual appearance, which is based on the attention mechanism. Comprehensive experiments on challenging benchmark datasets demonstrate the proposed method outperforming recent other state-of-the-art learned approaches, and achieving even higher inference speed. Our system has 14 times less params than Patch-NetVLAD, 6.8 times lower theoretical FLOPs, and run faster 21 and 33 times in feature extraction and feature matching. Moreover, the performance of our approach is 0.5\% better than Patch-NetVLAD in Recall@1. We used subsets of Mapillary Street Level Sequences dataset to conduct experiments for all other challenging conditions.

* CCF Conference on Computer Supported Cooperative Work and Social Computing (ChineseCSCW)

Via

Access Paper or Ask Questions

EventPoint: Self-Supervised Local Descriptor Learning for Event Cameras

Sep 01, 2021

Ze Huang, Songzhi Su, Henry Zhang, Kevin Sun

Figure 1 for EventPoint: Self-Supervised Local Descriptor Learning for Event Cameras

Figure 2 for EventPoint: Self-Supervised Local Descriptor Learning for Event Cameras

Figure 3 for EventPoint: Self-Supervised Local Descriptor Learning for Event Cameras

Figure 4 for EventPoint: Self-Supervised Local Descriptor Learning for Event Cameras

Abstract:We proposes a method of extracting intrest points and descriptors using self-supervised learning method on frame-based event data, which is called EventPoint. Different from other feature extraction methods on event data, we train our model on real event-form driving dataset--DSEC with the self-supervised learning method we proposed, the training progress fully consider the characteristics of event data.To verify the effectiveness of our work,we conducted several complete evaluations: we emulated DART and carried out feature matching experiments on N-caltech101 dataset, the results shows that the effect of EventPoint is better than DART; We use Vid2e tool provided by UZH to convert Oxford robotcar data into event-based format, and combined with INS information provided to carry out the global pose estimation experiment which is important in SLAM. As far as we know, this is the first work to carry out this challenging task.Sufficient experimental data show that EventPoint can get better results while achieve real time on CPU.

Via

Access Paper or Ask Questions

AinnoSeg: Panoramic Segmentation with High Perfomance

Jul 21, 2020

Jiahong Wu, Jianfei Lu, Xinxin Kang, Yiming Zhang, Yinhang Tang, Jianfei Song, Ze Huang, Shenglan Ben, Jiashui Huang, Faen Zhang

Figure 1 for AinnoSeg: Panoramic Segmentation with High Perfomance

Figure 2 for AinnoSeg: Panoramic Segmentation with High Perfomance

Figure 3 for AinnoSeg: Panoramic Segmentation with High Perfomance

Figure 4 for AinnoSeg: Panoramic Segmentation with High Perfomance

Abstract:Panoramic segmentation is a scene where image segmentation tasks is more difficult. With the development of CNN networks, panoramic segmentation tasks have been sufficiently developed.However, the current panoramic segmentation algorithms are more concerned with context semantics, but the details of image are not processed enough. Moreover, they cannot solve the problems which contains the accuracy of occluded object segmentation,little object segmentation,boundary pixel in object segmentation etc. Aiming to address these issues, this paper presents some useful tricks. (a) By changing the basic segmentation model, the model can take into account the large objects and the boundary pixel classification of image details. (b) Modify the loss function so that it can take into account the boundary pixels of multiple objects in the image. (c) Use a semi-supervised approach to regain control of the training process. (d) Using multi-scale training and reasoning. All these operations named AinnoSeg, AinnoSeg can achieve state-of-art performance on the well-known dataset ADE20K.

Via

Access Paper or Ask Questions