Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haoxuan Qu

TSTMotion: Training-free Scene-aware Text-to-motion Generation

May 05, 2025

Ziyan Guo, Haoxuan Qu, Hossein Rahmani, Dewen Soh, Ping Hu, Qiuhong Ke, Jun Liu

Abstract:Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a \textbf{T}raining-free \textbf{S}cene-aware \textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in \href{https://tstmotion.github.io/}{Project Page}.

* Accepted by ICME2025

Via

Access Paper or Ask Questions

CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework

Mar 05, 2025

Yanlong Xu, Haoxuan Qu, Jun Liu, Wenxiao Zhang, Xun Yang

Abstract:The goal of point cloud localization based on linguistic description is to identify a 3D position using textual description in large urban environments, which has potential applications in various fields, such as determining the location for vehicle pickup or goods delivery. Ideally, for a textual description and its corresponding 3D location, the objects around the 3D location should be fully described in the text description. However, in practical scenarios, e.g., vehicle pickup, passengers usually describe only the part of the most significant and nearby surroundings instead of the entire environment. In response to this $\textbf{partially relevant}$ challenge, we propose $\textbf{CMMLoc}$, an uncertainty-aware $\textbf{C}$auchy-$\textbf{M}$ixture-$\textbf{M}$odel ($\textbf{CMM}$) based framework for text-to-point-cloud $\textbf{Loc}$alization. To model the uncertain semantic relations between text and point cloud, we integrate CMM constraints as a prior during the interaction between the two modalities. We further design a spatial consolidation scheme to enable adaptive aggregation of different 3D objects with varying receptive fields. To achieve precise localization, we propose a cardinal direction integration module alongside a modality pre-alignment strategy, helping capture the spatial relationships among objects and bringing the 3D objects closer to the text modality. Comprehensive experiments validate that CMMLoc outperforms existing methods, achieving state-of-the-art results on the KITTI360Pose dataset. Codes are available in this GitHub repository https://github.com/kevin301342/CMMLoc.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

Feb 09, 2025

Yue Li, Haoxuan Qu, Mengyuan Liu, Jun Liu, Yujun Cai

Figure 1 for HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

Figure 2 for HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

Figure 3 for HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

Figure 4 for HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

Abstract:Transformers have demonstrated remarkable performance in skeleton-based human action recognition, yet their quadratic computational complexity remains a bottleneck for real-world applications. To mitigate this, linear attention mechanisms have been explored but struggle to capture the hierarchical structure of skeleton data. Meanwhile, the Poincar\'e model, as a typical hyperbolic geometry, offers a powerful framework for modeling hierarchical structures but lacks well-defined operations for existing mainstream linear attention. In this paper, we propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition. Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling. Theoretical analysis and extensive experiments on NTU RGB+D and NTU RGB+D 120 datasets demonstrate that HyLiFormer significantly reduces computational complexity while preserving model accuracy, making it a promising solution for efficiency-critical applications.

Via

Access Paper or Ask Questions

DisC-GS: Discontinuity-aware Gaussian Splatting

May 24, 2024

Haoxuan Qu, Zhuoling Li, Hossein Rahmani, Yujun Cai, Jun Liu

Figure 1 for DisC-GS: Discontinuity-aware Gaussian Splatting

Figure 2 for DisC-GS: Discontinuity-aware Gaussian Splatting

Figure 3 for DisC-GS: Discontinuity-aware Gaussian Splatting

Abstract:Recently, Gaussian Splatting, a method that represents a 3D scene as a collection of Gaussian distributions, has gained significant attention in addressing the task of novel view synthesis. In this paper, we highlight a fundamental limitation of Gaussian Splatting: its inability to accurately render discontinuities and boundaries in images due to the continuous nature of Gaussian distributions. To address this issue, we propose a novel framework enabling Gaussian Splatting to perform discontinuity-aware image rendering. Additionally, we introduce a B\'ezier-boundary gradient approximation strategy within our framework to keep the ``differentiability'' of the proposed discontinuity-aware rendering process. Extensive experiments demonstrate the efficacy of our framework.

Via

Access Paper or Ask Questions

Off-the-shelf ChatGPT is a Good Few-shot Human Motion Predictor

May 24, 2024

Haoxuan Qu, Zhaoyang He, Zeyu Hu, Yujun Cai, Jun Liu

Abstract:To facilitate the application of motion prediction in practice, recently, the few-shot motion prediction task has attracted increasing research attention. Yet, in existing few-shot motion prediction works, a specific model that is dedicatedly trained over human motions is generally required. In this work, rather than tackling this task through training a specific human motion prediction model, we instead propose a novel FMP-OC framework. In FMP-OC, in a totally training-free manner, we enable Few-shot Motion Prediction, which is a non-language task, to be performed directly via utilizing the Off-the-shelf language model ChatGPT. Specifically, to lead ChatGPT as a language model to become an accurate motion predictor, in FMP-OC, we first introduce several novel designs to facilitate extracting implicit knowledge from ChatGPT. Moreover, we also incorporate our framework with a motion-in-context learning mechanism. Extensive experiments demonstrate the efficacy of our proposed framework.

Via

Access Paper or Ask Questions

LLMs are Good Action Recognizers

Mar 31, 2024

Haoxuan Qu, Yujun Cai, Jun Liu

Figure 1 for LLMs are Good Action Recognizers

Figure 2 for LLMs are Good Action Recognizers

Figure 3 for LLMs are Good Action Recognizers

Figure 4 for LLMs are Good Action Recognizers

Abstract:Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.

* CVPR 2024

Via

Access Paper or Ask Questions

GPT-Connect: Interaction between Text-Driven Human Motion Generator and 3D Scenes in a Training-free Manner

Mar 22, 2024

Haoxuan Qu, Ziyan Guo, Jun Liu

Abstract:Recently, while text-driven human motion generation has received massive research attention, most existing text-driven motion generators are generally only designed to generate motion sequences in a blank background. While this is the case, in practice, human beings naturally perform their motions in 3D scenes, rather than in a blank background. Considering this, we here aim to perform scene-aware text-drive motion generation instead. Yet, intuitively training a separate scene-aware motion generator in a supervised way can require a large amount of motion samples to be troublesomely collected and annotated in a large scale of different 3D scenes. To handle this task rather in a relatively convenient manner, in this paper, we propose a novel GPT-connect framework. In GPT-connect, we enable scene-aware motion sequences to be generated directly utilizing the existing blank-background human motion generator, via leveraging ChatGPT to connect the existing motion generator with the 3D scene in a totally training-free manner. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework.

Via

Access Paper or Ask Questions

Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Mar 15, 2024

Hang Zhang, Wenxiao Zhang, Haoxuan Qu, Jun Liu

Figure 1 for Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Figure 2 for Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Figure 3 for Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Figure 4 for Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Abstract:Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Jan 02, 2024

Li Xu, Haoxuan Qu, Yujun Cai, Jun Liu

Figure 1 for 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Figure 2 for 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Figure 3 for 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Figure 4 for 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Abstract:Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.

* Fix typo

Via

Access Paper or Ask Questions

LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition

Sep 22, 2023

Haoxuan Qu, Xiaofei Hui, Yujun Cai, Jun Liu

Abstract:Open-set object recognition aims to identify if an object is from a class that has been encountered during training or not. To perform open-set object recognition accurately, a key challenge is how to reduce the reliance on spurious-discriminative features. In this paper, motivated by that different large models pre-trained through different paradigms can possess very rich while distinct implicit knowledge, we propose a novel framework named Large Model Collaboration (LMC) to tackle the above challenge via collaborating different off-the-shelf large models in a training-free manner. Moreover, we also incorporate the proposed framework with several novel designs to effectively extract implicit knowledge from large models. Extensive experiments demonstrate the efficacy of our proposed framework. Code is available \href{https://github.com/Harryqu123/LMC}{here}.

* NeurIPS 2023

Via

Access Paper or Ask Questions