Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuai Shi

Occupancy World Model for Robots

May 07, 2025

Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Jingkai Sun, Jiahang Cao, Jiaxu Wang(+5 more)

Abstract:Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.

Via

Access Paper or Ask Questions

RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots

Apr 20, 2025

Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Hengle Ren, Renjing Xu, Jian Tang

Abstract:3D occupancy prediction enables the robots to obtain spatial fine-grained geometry and semantics of the surrounding scene, and has become an essential task for embodied perception. Existing methods based on 3D Gaussians instead of dense voxels do not effectively exploit the geometry and opacity properties of Gaussians, which limits the network's estimation of complex environments and also limits the description of the scene by 3D Gaussians. In this paper, we propose a 3D occupancy prediction method which enhances the geometric and semantic scene understanding for robots, dubbed RoboOcc. It utilizes the Opacity-guided Self-Encoder (OSE) to alleviate the semantic ambiguity of overlapping Gaussians and the Geometry-aware Cross-Encoder (GCE) to accomplish the fine-grained geometric modeling of the surrounding scene. We conduct extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet datasets, and our RoboOcc achieves state-of the-art performance in both local and global camera settings. Further, in ablation studies of Gaussian parameters, the proposed RoboOcc outperforms the state-of-the-art methods by a large margin of (8.47, 6.27) in IoU and mIoU metric, respectively. The codes will be released soon.

Via

Access Paper or Ask Questions

VMambaCC: A Visual State Space Model for Crowd Counting

May 07, 2024

Hao-Yuan Ma, Li Zhang, Shuai Shi

Abstract:As a deep learning model, Visual Mamba (VMamba) has a low computational complexity and a global receptive field, which has been successful applied to image classification and detection. To extend its applications, we apply VMamba to crowd counting and propose a novel VMambaCC (VMamba Crowd Counting) model. Naturally, VMambaCC inherits the merits of VMamba, or global modeling for images and low computational cost. Additionally, we design a Multi-head High-level Feature (MHF) attention mechanism for VMambaCC. MHF is a new attention mechanism that leverages high-level semantic features to augment low-level semantic features, thereby enhancing spatial feature representation with greater precision. Building upon MHF, we further present a High-level Semantic Supervised Feature Pyramid Network (HS2PFN) that progressively integrates and enhances high-level semantic information with low-level semantic information. Extensive experimental results on five public datasets validate the efficacy of our approach. For example, our method achieves a mean absolute error of 51.87 and a mean squared error of 81.3 on the ShangHaiTech\_PartA dataset. Our code is coming soon.

Via

Access Paper or Ask Questions

Towards V2I Age-aware Fairness Access: A DQN Based Intelligent Vehicular Node Training and Test Method

Aug 02, 2022

Qiong Wu, Shuai Shi, Ziyang Wan, Qiang Fan, Pingyi Fan, Cui Zhang

Figure 1 for Towards V2I Age-aware Fairness Access: A DQN Based Intelligent Vehicular Node Training and Test Method

Figure 2 for Towards V2I Age-aware Fairness Access: A DQN Based Intelligent Vehicular Node Training and Test Method

Figure 3 for Towards V2I Age-aware Fairness Access: A DQN Based Intelligent Vehicular Node Training and Test Method

Figure 4 for Towards V2I Age-aware Fairness Access: A DQN Based Intelligent Vehicular Node Training and Test Method

Abstract:Vehicles on the road exchange data with base station (BS) frequently through vehicle to infrastructure (V2I) communications to ensure the normal use of vehicular applications, where the IEEE 802.11 distributed coordination function (DCF) is employed to allocate a minimum contention window (MCW) for channel access. Each vehicle may change its MCW to achieve more access opportunities at the expense of others, which results in unfair communication performance. Moreover, the key access parameters MCW is the privacy information and each vehicle are not willing to share it with other vehicles. In this uncertain setting, age of information (AoI) is an important communication metric to measure the freshness of data, we design an intelligent vehicular node to learn the dynamic environment and predict the optimal MCW which can make it achieve age fairness. In order to allocate the optimal MCW for the vehicular node, we employ a learning algorithm to make a desirable decision by learning from replay history data. In particular, the algorithm is proposed by extending the traditional DQN training and testing method. Finally, by comparing with other methods, it is proved that the proposed DQN method can significantly improve the age fairness of the intelligent node.

* This paper has been accepted by Chinese Journal of Electronics. Simulation codes have been provided at: https://github.com/qiongwu86/Age-Fairness

Via

Access Paper or Ask Questions

Dog nose print matching with dual global descriptor based on Contrastive Learning

Jun 01, 2022

Bin Li, Zhongan Wang, Nan Wu, Shuai Shi, Qijun Ma

Figure 1 for Dog nose print matching with dual global descriptor based on Contrastive Learning

Figure 2 for Dog nose print matching with dual global descriptor based on Contrastive Learning

Figure 3 for Dog nose print matching with dual global descriptor based on Contrastive Learning

Figure 4 for Dog nose print matching with dual global descriptor based on Contrastive Learning

Abstract:Recent studies in biometric-based identification tasks have shown that deep learning methods can achieve better performance. These methods generally extract the global features as descriptor to represent the original image. Nonetheless, it does not perform well for biometric identification under fine-grained tasks. The main reason is that the single image descriptor contains insufficient information to represent image. In this paper, we present a dual global descriptor model, which combines multiple global descriptors to exploit multi level image features. Moreover, we utilize a contrastive loss to enlarge the distance between image representations of confusing classes. The proposed framework achieves the top2 on the CVPR2022 Biometrics Workshop Pet Biometric Challenge. The source code and trained models are publicly available at: https://github.com/flyingsheepbin/pet-biometrics

Via

Access Paper or Ask Questions