Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wanquan Feng

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Mar 25, 2025

Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang

Abstract:Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

Mar 10, 2025

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, Jing Liu

Abstract:The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, which achieves competitive and state-of-the-art results across four challenging benchmarks.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Dec 05, 2024

Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, Qian He

Figure 1 for AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Figure 2 for AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Figure 3 for AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Figure 4 for AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Abstract:Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as a plug-in module to easily integrate with any community control extensions for diffusion models, improving the diversity and controllability of synthesized images. Extensive experiments show that AnyDressing achieves state-of-the-art results.

* Project page: https://crayon-shinchan.github.io/AnyDressing/

Via

Access Paper or Ask Questions

I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Nov 26, 2024

Wanquan Feng, Jiawei Liu, Pengqi Tu, Tianhao Qi, Mingzhen Sun, Tianxiang Ma, Songtao Zhao, Siyu Zhou, Qian He

Figure 1 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Figure 2 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Figure 3 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Figure 4 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Abstract:Video generation technologies are developing rapidly and have broad potential applications. Among these technologies, camera control is crucial for generating professional-quality videos that accurately meet user expectations. However, existing camera control methods still suffer from several limitations, including control precision and the neglect of the control for subject motion dynamics. In this work, we propose I2VControl-Camera, a novel camera control method that significantly enhances controllability while providing adjustability over the strength of subject motion. To improve control precision, we employ point trajectory in the camera coordinate system instead of only extrinsic matrix information as our control signal. To accurately control and adjust the strength of subject motion, we explicitly model the higher-order components of the video trajectory expansion, not merely the linear terms, and design an operator that effectively represents the motion strength. We use an adapter architecture that is independent of the base model structure. Experiments on static and dynamic scenes show that our framework outperformances previous methods both quantitatively and qualitatively. The project page is: https://wanquanf.github.io/I2VControlCamera .

* Project page: https://wanquanf.github.io/I2VControlCamera

Via

Access Paper or Ask Questions

I2VControl: Disentangled and Unified Video Motion Synthesis Control

Nov 26, 2024

Wanquan Feng, Tianhao Qi, Jiawei Liu, Mingzhen Sun, Pengqi Tu, Tianxiang Ma, Fei Dai, Songtao Zhao, Siyu Zhou, Qian He

Figure 1 for I2VControl: Disentangled and Unified Video Motion Synthesis Control

Figure 2 for I2VControl: Disentangled and Unified Video Motion Synthesis Control

Figure 3 for I2VControl: Disentangled and Unified Video Motion Synthesis Control

Figure 4 for I2VControl: Disentangled and Unified Video Motion Synthesis Control

Abstract:Video synthesis techniques are undergoing rapid progress, with controllability being a significant aspect of practical usability for end-users. Although text condition is an effective way to guide video synthesis, capturing the correct joint distribution between text descriptions and video motion remains a substantial challenge. In this paper, we present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis. Our approach partitions the video into individual motion units and represents each unit with disentangled control signals, which allows for various control types to be flexibly combined within our single system. Furthermore, our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. The project page is: https://wanquanf.github.io/I2VControl .

* Project page: https://wanquanf.github.io/I2VControl

Via

Access Paper or Ask Questions

DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Oct 28, 2024

Xun Guo, Shan Zhang, Yongxin He, Ting Zhang, Wanquan Feng, Haibin Huang, Chongyang Ma

Figure 1 for DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Figure 2 for DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Figure 3 for DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Figure 4 for DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Abstract:Current techniques for detecting AI-generated text are largely confined to manual feature crafting and supervised binary classification paradigms. These methodologies typically lead to performance bottlenecks and unsatisfactory generalizability. Consequently, these methods are often inapplicable for out-of-distribution (OOD) data and newly emerged large language models (LLMs). In this paper, we revisit the task of AI-generated text detection. We argue that the key to accomplishing this task lies in distinguishing writing styles of different authors, rather than simply classifying the text into human-written or AI-generated text. To this end, we propose DeTeCtive, a multi-task auxiliary, multi-level contrastive learning framework. DeTeCtive is designed to facilitate the learning of distinct writing styles, combined with a dense information retrieval pipeline for AI-generated text detection. Our method is compatible with a range of text encoders. Extensive experiments demonstrate that our method enhances the ability of various text encoders in detecting AI-generated text across multiple benchmarks and achieves state-of-the-art results. Notably, in OOD zero-shot evaluation, our method outperforms existing approaches by a large margin. Moreover, we find our method boasts a Training-Free Incremental Adaptation (TFIA) capability towards OOD data, further enhancing its efficacy in OOD detection scenarios. We will open-source our code and models in hopes that our work will spark new thoughts in the field of AI-generated text detection, ensuring safe application of LLMs and enhancing compliance. Our code is available at https://github.com/heyongxin233/DeTeCtive.

* To appear in NeurIPS 2024. Code is available at https://github.com/heyongxin233/DeTeCtive

Via

Access Paper or Ask Questions

StyleBrush: Style Extraction and Transfer from a Single Image

Aug 18, 2024

Wancheng Feng, Wanquan Feng, Dawei Huang, Jiaming Pei, Guangliang Cheng, Lukun Wang

Abstract:Stylization for visual content aims to add specific style patterns at the pixel level while preserving the original structural features. Compared with using predefined styles, stylization guided by reference style images is more challenging, where the main difficulty is to effectively separate style from structural elements. In this paper, we propose StyleBrush, a method that accurately captures styles from a reference image and ``brushes'' the extracted style onto other input visual content. Specifically, our architecture consists of two branches: ReferenceNet, which extracts style from the reference image, and Structure Guider, which extracts structural features from the input image, thus enabling image-guided stylization. We utilize LLM and T2I models to create a dataset comprising 100K high-quality style images, encompassing a diverse range of styles and contents with high aesthetic score. To construct training pairs, we crop different regions of the same training image. Experiments show that our approach achieves state-of-the-art results through both qualitative and quantitative analyses. We will release our code and dataset upon acceptance of the paper.

* 9 pages, 6figures, Under Review

Via

Access Paper or Ask Questions

Towards Practical Capture of High-Fidelity Relightable Avatars

Sep 08, 2023

Haotian Yang, Mingwu Zheng, Wanquan Feng, Haibin Huang, Yu-Kun Lai, Pengfei Wan, Zhongyuan Wang, Chongyang Ma

Figure 1 for Towards Practical Capture of High-Fidelity Relightable Avatars

Figure 2 for Towards Practical Capture of High-Fidelity Relightable Avatars

Figure 3 for Towards Practical Capture of High-Fidelity Relightable Avatars

Figure 4 for Towards Practical Capture of High-Fidelity Relightable Avatars

Abstract:In this paper, we propose a novel framework, Tracking-free Relightable Avatar (TRAvatar), for capturing and reconstructing high-fidelity 3D avatars. Compared to previous methods, TRAvatar works in a more practical and efficient setting. Specifically, TRAvatar is trained with dynamic image sequences captured in a Light Stage under varying lighting conditions, enabling realistic relighting and real-time animation for avatars in diverse scenes. Additionally, TRAvatar allows for tracking-free avatar capture and obviates the need for accurate surface tracking under varying illumination conditions. Our contributions are two-fold: First, we propose a novel network architecture that explicitly builds on and ensures the satisfaction of the linear nature of lighting. Trained on simple group light captures, TRAvatar can predict the appearance in real-time with a single forward pass, achieving high-quality relighting effects under illuminations of arbitrary environment maps. Second, we jointly optimize the facial geometry and relightable appearance from scratch based on image sequences, where the tracking is implicitly learned. This tracking-free approach brings robustness for establishing temporal correspondences between frames under different lighting conditions. Extensive qualitative and quantitative experiments demonstrate that our framework achieves superior performance for photorealistic avatar animation and relighting.

* Accepted to SIGGRAPH Asia 2023 (Conference); Project page: https://travatar-paper.github.io/

Via

Access Paper or Ask Questions

Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Jun 30, 2022

Hongrui Cai, Wanquan Feng, Xuetao Feng, Yan Wang, Juyong Zhang

Figure 1 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Figure 2 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Figure 3 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Figure 4 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Abstract:We propose Neural-DynamicReconstruction (NDR), a template-free method to recover high-fidelity geometry and motions of a dynamic scene from a monocular RGB-D camera. In NDR, we adopt the neural implicit function for surface representation and rendering such that the captured color and depth can be fully utilized to jointly optimize the surface and deformations. To represent and constrain the non-rigid deformations, we propose a novel neural invertible deforming network such that the cycle consistency between arbitrary two frames is automatically satisfied. Considering that the surface topology of dynamic scene might change over time, we employ a topology-aware strategy to construct the topology-variant correspondence for the fused frames. NDR also further refines the camera poses in a global optimization manner. Experiments on public datasets and our collected dataset demonstrate that NDR outperforms existing monocular dynamic reconstruction methods.

* Project page: https://ustc3dv.github.io/ndr

Via

Access Paper or Ask Questions

Neural Points: Point Cloud Representation with Neural Fields

Dec 13, 2021

Wanquan Feng, Jin Li, Hongrui Cai, Xiaonan Luo, Juyong Zhang

Figure 1 for Neural Points: Point Cloud Representation with Neural Fields

Figure 2 for Neural Points: Point Cloud Representation with Neural Fields

Figure 3 for Neural Points: Point Cloud Representation with Neural Fields

Figure 4 for Neural Points: Point Cloud Representation with Neural Fields

Abstract:In this paper, we propose \emph{Neural Points}, a novel point cloud representation. Unlike traditional point cloud representation where each point only represents a position or a local plane in the 3D space, each point in Neural Points represents a local continuous geometric shape via neural fields. Therefore, Neural Points can express much more complex details and thus have a stronger representation ability. Neural Points is trained with high-resolution surface containing rich geometric details, such that the trained model has enough expression ability for various shapes. Specifically, we extract deep local features on the points and construct neural fields through the local isomorphism between the 2D parametric domain and the 3D local patch. In the final, local neural fields are integrated together to form the global surface. Experimental results show that Neural Points has powerful representation ability and demonstrate excellent robustness and generalization ability. With Neural Points, we can resample point cloud with arbitrary resolutions, and it outperforms state-of-the-art point cloud upsampling methods by a large margin.

* Project page: https://wanquanf.github.io/NeuralPoints.html

Via

Access Paper or Ask Questions