Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zeyu Han

Jun

Absolute Coordinates Make Motion Generation Easy

May 26, 2025

Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, Huaizu Jiang

Abstract:State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.

* Preprint

Via

Access Paper or Ask Questions

Rethinking Diffusion for Text-Driven Human Motion Generation

Nov 25, 2024

Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, Huaizu Jiang

Figure 1 for Rethinking Diffusion for Text-Driven Human Motion Generation

Figure 2 for Rethinking Diffusion for Text-Driven Human Motion Generation

Figure 3 for Rethinking Diffusion for Text-Driven Human Motion Generation

Figure 4 for Rethinking Diffusion for Text-Driven Human Motion Generation

Abstract:Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.

* Preprint

Via

Access Paper or Ask Questions

DenserRadar: A 4D millimeter-wave radar point cloud detector based on dense LiDAR point clouds

May 08, 2024

Zeyu Han, Junkai Jiang, Xiaokang Ding, Qingwen Meng, Shaobing Xu, Lei He, Jianqiang Wang

Figure 1 for DenserRadar: A 4D millimeter-wave radar point cloud detector based on dense LiDAR point clouds

Figure 2 for DenserRadar: A 4D millimeter-wave radar point cloud detector based on dense LiDAR point clouds

Figure 3 for DenserRadar: A 4D millimeter-wave radar point cloud detector based on dense LiDAR point clouds

Figure 4 for DenserRadar: A 4D millimeter-wave radar point cloud detector based on dense LiDAR point clouds

Abstract:The 4D millimeter-wave (mmWave) radar, with its robustness in extreme environments, extensive detection range, and capabilities for measuring velocity and elevation, has demonstrated significant potential for enhancing the perception abilities of autonomous driving systems in corner-case scenarios. Nevertheless, the inherent sparsity and noise of 4D mmWave radar point clouds restrict its further development and practical application. In this paper, we introduce a novel 4D mmWave radar point cloud detector, which leverages high-resolution dense LiDAR point clouds. Our approach constructs dense 3D occupancy ground truth from stitched LiDAR point clouds, and employs a specially designed network named DenserRadar. The proposed method surpasses existing probability-based and learning-based radar point cloud detectors in terms of both point cloud density and accuracy on the K-Radar dataset.

Via

Access Paper or Ask Questions

Neural Radiance Field in Autonomous Driving: A Survey

Apr 26, 2024

Lei He, Leheng Li, Wenchao Sun, Zeyu Han, Yichen Liu, Sifa Zheng, Jianqiang Wang, Keqiang Li

Figure 1 for Neural Radiance Field in Autonomous Driving: A Survey

Figure 2 for Neural Radiance Field in Autonomous Driving: A Survey

Figure 3 for Neural Radiance Field in Autonomous Driving: A Survey

Figure 4 for Neural Radiance Field in Autonomous Driving: A Survey

Abstract:Neural Radiance Field (NeRF) has garnered significant attention from both academia and industry due to its intrinsic advantages, particularly its implicit representation and novel view synthesis capabilities. With the rapid advancements in deep learning, a multitude of methods have emerged to explore the potential applications of NeRF in the domain of Autonomous Driving (AD). However, a conspicuous void is apparent within the current literature. To bridge this gap, this paper conducts a comprehensive survey of NeRF's applications in the context of AD. Our survey is structured to categorize NeRF's applications in Autonomous Driving (AD), specifically encompassing perception, 3D reconstruction, simultaneous localization and mapping (SLAM), and simulation. We delve into in-depth analysis and summarize the findings for each application category, and conclude by providing insights and discussions on future directions in this field. We hope this paper serves as a comprehensive reference for researchers in this domain. To the best of our knowledge, this is the first survey specifically focused on the applications of NeRF in the Autonomous Driving domain.

Via

Access Paper or Ask Questions

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Apr 16, 2024

Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou(+122 more)

Figure 1 for The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Figure 2 for The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Figure 3 for The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Figure 4 for The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Abstract:This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.

* The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

Via

Access Paper or Ask Questions

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Mar 21, 2024

Zeyu Han, Chao Gao, Jinyang Liu, Jeff, Zhang, Sai Qian Zhang

Figure 1 for Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Figure 2 for Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Figure 3 for Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Figure 4 for Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Abstract:Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities. Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design. In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.

* 25 pages, 13 figures

Via

Access Paper or Ask Questions

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Nov 28, 2023

Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang

Figure 1 for Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Figure 2 for Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Figure 3 for Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Figure 4 for Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Abstract:Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to the provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model.

Via

Access Paper or Ask Questions

Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

Aug 20, 2023

Jie Zeng, Zeyu Han, Xingchen Peng, Jianghong Xiao, Peng Wang, Yan Wang

Figure 1 for Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

Figure 2 for Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

Figure 3 for Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

Figure 4 for Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

Abstract:Recently, deep learning (DL) has automated and accelerated the clinical radiation therapy (RT) planning significantly by predicting accurate dose maps. However, most DL-based dose map prediction methods are data-driven and not applicable for cervical cancer where only a small amount of data is available. To address this problem, this paper proposes to transfer the rich knowledge learned from another cancer, i.e., rectum cancer, which has the same scanning area and more clinically available data, to improve the dose map prediction performance for cervical cancer through domain adaptation. In order to close the congenital domain gap between the source (i.e., rectum cancer) and the target (i.e., cervical cancer) domains, we develop an effective Transformer-based polymerized feature module (PFM), which can generate an optimal polymerized feature distribution to smoothly align the two input distributions. Experimental results on two in-house clinical datasets demonstrate the superiority of the proposed method compared with state-of-the-art methods.

* Accepted and presented in ISBI 2023. To be published in Proceedings

Via

Access Paper or Ask Questions

Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Aug 20, 2023

Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, Dinggang Shen

Figure 1 for Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Figure 2 for Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Figure 3 for Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Figure 4 for Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Abstract:To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.

* Accepted and presented in MICCAI 2023. To be published in Proceedings

Via

Access Paper or Ask Questions

Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Aug 08, 2023

Binfeng Xu, Xukun Liu, Hua Shen, Zeyu Han, Yuhan Li, Murong Yue, Zhiyuan Peng, Yuchen Liu, Ziyu Yao, Dongkuan Xu

Figure 1 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Figure 2 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Figure 3 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Figure 4 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Abstract:Augmented Language Models (ALMs) empower large language models with the ability to use tools, transforming them into intelligent agents for real-world interactions. However, most existing frameworks for ALMs, to varying degrees, are deficient in the following critical features: flexible customization, collaborative democratization, and holistic evaluation. We present gentopia, an ALM framework enabling flexible customization of agents through simple configurations, seamlessly integrating various language models, task formats, prompting modules, and plugins into a unified paradigm. Furthermore, we establish gentpool, a public platform enabling the registration and sharing of user-customized agents. Agents registered in gentpool are composable such that they can be assembled together for agent collaboration, advancing the democratization of artificial intelligence. To ensure high-quality agents, gentbench, an integral component of gentpool, is designed to thoroughly evaluate user-customized agents across diverse aspects such as safety, robustness, efficiency, etc. We release gentopia on Github and will continuously move forward.

Via

Access Paper or Ask Questions