Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zejun Yang

Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

Apr 15, 2026

Zhen Liu, Xinyu Ning, Zhe Hu, Xinxin Xie, Weize Li, Zhipeng Tang, Chongyu Wang, Zejun Yang, Hanlin Wang, Yitong Liu(+1 more)

Abstract:Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation. Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A low-level executor, instantiated as a VLA-based visuomotor controller, carries out each sub-task through diffusion-based action generation conditioned on geometry-preserving filtered observations. Together, the two systems form a closed loop between planning and execution, enabling memory-aware reasoning, adaptive replanning, and robust online recovery. Experiments on representative RMBench tasks show that the proposed framework substantially outperforms representative baselines, achieving a 32.4% average success rate compared with 9.8% for the strongest baseline. Ablation studies further confirm the importance of structured memory and closed-loop recovery for long-horizon manipulation.

Via

Access Paper or Ask Questions

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Mar 26, 2024

Huawei Wei, Zejun Yang, Zhisheng Wang

Figure 1 for AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Figure 2 for AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Abstract:In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at https://github.com/scutzzj/AniPortrait

Via

Access Paper or Ask Questions

3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands

Jan 02, 2024

Xuan Huang, Hanhui Li, Zejun Yang, Zhisheng Wang, Xiaodan Liang

Figure 1 for 3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands

Figure 2 for 3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands

Figure 3 for 3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands

Figure 4 for 3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands

Abstract:Neural radiance fields (NeRFs) are promising 3D representations for scenes, objects, and humans. However, most existing methods require multi-view inputs and per-scene training, which limits their real-life applications. Moreover, current methods focus on single-subject cases, leaving scenes of interacting hands that involve severe inter-hand occlusions and challenging view variations remain unsolved. To tackle these issues, this paper proposes a generalizable visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically, given an image of interacting hands as input, our VA-NeRF first obtains a mesh-based representation of hands and extracts their corresponding geometric and textural features. Subsequently, a feature fusion module that exploits the visibility of query points and mesh vertices is introduced to adaptively merge features of both hands, enabling the recovery of features in unseen areas. Additionally, our VA-NeRF is optimized together with a novel discriminator within an adversarial learning paradigm. In contrast to conventional discriminators that predict a single real/fake label for the synthesized image, the proposed discriminator generates a pixel-wise visibility map, providing fine-grained supervision for unseen areas and encouraging the VA-NeRF to improve the visual quality of synthesized images. Experiments on the Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms conventional NeRFs significantly. Project Page: \url{https://github.com/XuanHuang0/VANeRF}.

* Accepted by AAAI-24

Via

Access Paper or Ask Questions

Monocular 3D Hand Mesh Recovery via Dual Noise Estimation

Dec 26, 2023

Hanhui Li, Xiaojian Lin, Xuan Huang, Zejun Yang, Zhisheng Wang, Xiaodan Liang

Figure 1 for Monocular 3D Hand Mesh Recovery via Dual Noise Estimation

Figure 2 for Monocular 3D Hand Mesh Recovery via Dual Noise Estimation

Figure 3 for Monocular 3D Hand Mesh Recovery via Dual Noise Estimation

Figure 4 for Monocular 3D Hand Mesh Recovery via Dual Noise Estimation

Abstract:Current parametric models have made notable progress in 3D hand pose and shape estimation. However, due to the fixed hand topology and complex hand poses, current models are hard to generate meshes that are aligned with the image well. To tackle this issue, we introduce a dual noise estimation method in this paper. Given a single-view image as input, we first adopt a baseline parametric regressor to obtain the coarse hand meshes. We assume the mesh vertices and their image-plane projections are noisy, and can be associated in a unified probabilistic model. We then learn the distributions of noise to refine mesh vertices and their projections. The refined vertices are further utilized to refine camera parameters in a closed-form manner. Consequently, our method obtains well-aligned and high-quality 3D hand meshes. Extensive experiments on the large-scale Interhand2.6M dataset demonstrate that the proposed method not only improves the performance of its baseline by more than 10$\%$ but also achieves state-of-the-art performance. Project page: \url{https://github.com/hanhuili/DNE4Hand}.

* Accepted by AAAI-24

Via

Access Paper or Ask Questions

LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Aug 23, 2023

Siqi Yang, Zejun Yang, Zhisheng Wang

Figure 1 for LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Figure 2 for LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Figure 3 for LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Figure 4 for LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Abstract:Dancing with music is always an essential human art form to express emotion. Due to the high temporal-spacial complexity, long-term 3D realist dance generation synchronized with music is challenging. Existing methods suffer from the freezing problem when generating long-term dances due to error accumulation and training-inference discrepancy. To address this, we design a conditional diffusion model, LongDanceDiff, for this sequence-to-sequence long-term dance generation, addressing the challenges of temporal coherency and spatial constraint. LongDanceDiff contains a transformer-based diffusion model, where the input is a concatenation of music, past motions, and noised future motions. This partial noising strategy leverages the full-attention mechanism and learns the dependencies among music and past motions. To enhance the diversity of generated dance motions and mitigate the freezing problem, we introduce a mutual information minimization objective that regularizes the dependency between past and future motions. We also address common visual quality issues in dance generation, such as foot sliding and unsmooth motion, by incorporating spatial constraints through a Global-Trajectory Modulation (GTM) layer and motion perceptual losses, thereby improving the smoothness and naturalness of motion generation. Extensive experiments demonstrate a significant improvement in our approach over the existing state-of-the-art methods. We plan to release our codes and models soon.

Via

Access Paper or Ask Questions