Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Li Guo

CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

Oct 14, 2025

Jinzhou Lin, Jie Zhou, Wenhao Xu, Rongtao Xu, Changwei Wang, Shunpeng Chen, Kexue Fu, Yihua Shao, Li Guo, Shibiao Xu

Abstract:Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.

Via

Access Paper or Ask Questions

3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Jul 16, 2025

Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen, Changwei Wang, Li Guo, Xiaodan Liang, Shibiao Xu

Figure 1 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Figure 2 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Figure 3 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Figure 4 for 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

Abstract:With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15\%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84\%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.

* Accepted by IROS 2025

Via

Access Paper or Ask Questions

Hyperbolic-PDE GNN: Spectral Graph Neural Networks in the Perspective of A System of Hyperbolic Partial Differential Equations

May 29, 2025

Juwei Yue, Haikuo Li, Jiawei Sheng, Xiaodong Li, Taoyu Su, Tingwen Liu, Li Guo

Abstract:Graph neural networks (GNNs) leverage message passing mechanisms to learn the topological features of graph data. Traditional GNNs learns node features in a spatial domain unrelated to the topology, which can hardly ensure topological features. In this paper, we formulates message passing as a system of hyperbolic partial differential equations (hyperbolic PDEs), constituting a dynamical system that explicitly maps node representations into a particular solution space. This solution space is spanned by a set of eigenvectors describing the topological structure of graphs. Within this system, for any moment in time, a node features can be decomposed into a superposition of the basis of eigenvectors. This not only enhances the interpretability of message passing but also enables the explicit extraction of fundamental characteristics about the topological structure. Furthermore, by solving this system of hyperbolic partial differential equations, we establish a connection with spectral graph neural networks (spectral GNNs), serving as a message passing enhancement paradigm for spectral GNNs.We further introduce polynomials to approximate arbitrary filter functions. Extensive experiments demonstrate that the paradigm of hyperbolic PDEs not only exhibits strong flexibility but also significantly enhances the performance of various spectral GNNs across diverse graph tasks.

* International Conference on Machine Learning 2025
* 18 pages, 2 figures, published to ICML 2025

Via

Access Paper or Ask Questions

Graph Wave Networks

May 26, 2025

Juwei Yue, Haikuo Li, Jiawei Sheng, Yihan Guo, Xinghua Zhang, Chuan Zhou, Tingwen Liu, Li Guo

Abstract:Dynamics modeling has been introduced as a novel paradigm in message passing (MP) of graph neural networks (GNNs). Existing methods consider MP between nodes as a heat diffusion process, and leverage heat equation to model the temporal evolution of nodes in the embedding space. However, heat equation can hardly depict the wave nature of graph signals in graph signal processing. Besides, heat equation is essentially a partial differential equation (PDE) involving a first partial derivative of time, whose numerical solution usually has low stability, and leads to inefficient model training. In this paper, we would like to depict more wave details in MP, since graph signals are essentially wave signals that can be seen as a superposition of a series of waves in the form of eigenvector. This motivates us to consider MP as a wave propagation process to capture the temporal evolution of wave signals in the space. Based on wave equation in physics, we innovatively develop a graph wave equation to leverage the wave propagation on graphs. In details, we demonstrate that the graph wave equation can be connected to traditional spectral GNNs, facilitating the design of graph wave networks based on various Laplacians and enhancing the performance of the spectral GNNs. Besides, the graph wave equation is particularly a PDE involving a second partial derivative of time, which has stronger stability on graphs than the heat equation that involves a first partial derivative of time. Additionally, we theoretically prove that the numerical solution derived from the graph wave equation are constantly stable, enabling to significantly enhance model efficiency while ensuring its performance. Extensive experiments show that GWNs achieve SOTA and efficient performance on benchmark datasets, and exhibit outstanding performance in addressing challenging graph problems, such as over-smoothing and heterophily.

* The ACM Web Conference 2025
* 15 pages, 8 figures, published to WWW 2025

Via

Access Paper or Ask Questions

Movable-Element STARS-Assisted Near-Field Wideband Communications

May 25, 2025

Guangyu Zhu, Xidong Mu, Li Guo, Ao Huang, Shibiao Xu

Abstract:A novel movable-element simultaneously transmitting and reflecting surface (ME-STARS)-assisted near-field wideband communication framework is proposed. In particular, the position of each STARS element can be adjusted to combat the significant wideband beam squint issue in the near field instead of using costly true-time delay components. Four practical ME-STARS element movement modes are proposed, namely region-based (RB), horizontal-based (HB), vertical-based (VB), and diagonal-based (DB) modes. Based on this, a near-field wideband multi-user downlink communication scenario is considered, where a sum rate maximization problem is formulated by jointly optimizing the base station (BS) precoding, ME-STARS beamforming, and element positions. To solve this intractable problem, a two-layer algorithm is developed. For the inner layer, the block coordinate descent optimization framework is utilized to solve the BS precoding and ME-STARS beamforming in an iterative manner. For the outer layer, the particle swarm optimization-based heuristic search method is employed to determine the desired element positions. Numerical results show that:1) the ME-STARSs can effectively address the beam squint for near-field wideband communications compared to conventional STARSs with fixed element positions; 2) the RB mode achieves the most efficient beam squint effect mitigation, while the DB mode achieves the best trade-off between performance gain and hardware overhead; and 3) an increase in the number of ME-STARS elements or BS subcarriers substantially improves the system performance.

Via

Access Paper or Ask Questions

FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

May 23, 2025

Zherui Zhang, Jiaxin Wu, Changwei Wang, Rongtao Xu, Longzhao Huang, Wenhao Xu, Wenbo Xu, Li Guo, Shibiao Xu

Figure 1 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Figure 2 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Figure 3 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Figure 4 for FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

Abstract:Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {{\large {\textbf{F}}}}aster {{\large {\textbf{D}}}}istillation-{{\large {\textbf{B}}}}ased {{\large {\textbf{P}}}}rompt {{\large {\textbf{L}}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.

Via

Access Paper or Ask Questions

Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model

May 14, 2025

George Andriopoulos, Soyuj Jung Basnet, Juan Guevara, Li Guo, Keith Ross

Abstract:The Unconstrained Feature Model (UFM) is a mathematical framework that enables closed-form approximations for minimal training loss and related performance measures in deep neural networks (DNNs). This paper leverages the UFM to provide qualitative insights into neural multivariate regression, a critical task in imitation learning, robotics, and reinforcement learning. Specifically, we address two key questions: (1) How do multi-task models compare to multiple single-task models in terms of training performance? (2) Can whitening and normalizing regression targets improve training performance? The UFM theory predicts that multi-task models achieve strictly smaller training MSE than multiple single-task models when the same or stronger regularization is applied to the latter, and our empirical results confirm these findings. Regarding whitening and normalizing regression targets, the UFM theory predicts that they reduce training MSE when the average variance across the target dimensions is less than one, and our empirical results once again confirm these findings. These findings highlight the UFM as a powerful framework for deriving actionable insights into DNN design and data pre-processing strategies.

* 31 pages, 8 figures

Via

Access Paper or Ask Questions

Image Recognition with Online Lightweight Vision Transformer: A Survey

May 06, 2025

Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu(+1 more)

Abstract:The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT

Via

Access Paper or Ask Questions

CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Apr 30, 2025

Zherui Zhang, Changwei Wang, Rongtao Xu, Wenhao Xu, Shibiao Xu, Yu Zhang, Li Guo

Figure 1 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Figure 2 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Figure 3 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Figure 4 for CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Abstract:Data-Free Knowledge Distillation (DFKD) enables the knowledge transfer from the given pre-trained teacher network to the target student model without access to the real training data. Existing DFKD methods focus primarily on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations. In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including: \textit{\textbf{i.)}} Significant efficiency advantages resulting from altering the generator training paradigm; \textit{\textbf{ii.)}} Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks; \textit{\textbf{iii.)}} Remarkable transferability of data-free learned representations demonstrated in downstream tasks.

Via

Access Paper or Ask Questions

Pinching-Antenna Systems (PASS)-enabled Secure Wireless Communications

Apr 18, 2025

Guangyu Zhu, Xidong Mu, Li Guo, Shibiao Xu, Yuanwei Liu, Naofal Al-Dhahir

Abstract:A novel pinching-antenna systems (PASS)-enabled secure wireless communication framework is proposed. By dynamically adjusting the positions of dielectric particles, namely pinching antennas (PAs), along the waveguides, PASS introduces a novel concept of pinching beamforming to enhance the performance of physical layer security. A fundamental PASS-enabled secure communication system is considered with one legitimate user and one eavesdropper. Both single-waveguide and multiple-waveguide scenarios are studied. 1) For the single-waveguide scenario, the secrecy rate (SR) maximization is formulated to optimize the pinching beamforming. A PA-wise successive tuning (PAST) algorithm is proposed, which ensures constructive signal superposition at the legitimate user while inducing a destructive legitimate signal at the eavesdropper. 2) For the multiple-waveguide scenario, artificial noise (AN) is employed to further improve secrecy performance. A pair of practical transmission architectures are developed: waveguide division (WD) and waveguide multiplexing (WM). The key difference lies in whether each waveguide carries a single type of signal or a mixture of signals with baseband beamforming. For the SR maximization problem under the WD case, a two-stage algorithm is developed, where the pinching beamforming is designed with the PAST algorithm and the baseband power allocation among AN and legitimate signals is solved using successive convex approximation (SCA). For the WM case, an alternating optimization algorithm is developed, where the baseband beamforming is optimized with SCA and the pinching beamforming is designed employing particle swarm optimization.

Via

Access Paper or Ask Questions