Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peihao Zhu

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Jun 10, 2025

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li(+34 more)

Figure 1 for Seedance 1.0: Exploring the Boundaries of Video Generation Models

Figure 2 for Seedance 1.0: Exploring the Boundaries of Video Generation Models

Figure 3 for Seedance 1.0: Exploring the Boundaries of Video Generation Models

Figure 4 for Seedance 1.0: Exploring the Boundaries of Video Generation Models

Abstract:Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.

* Seedance 1.0 Technical Report

Via

Access Paper or Ask Questions

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Apr 11, 2025

Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang(+44 more)

Figure 1 for Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Figure 2 for Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Figure 3 for Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Figure 4 for Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Abstract:This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/

* Technical report

Via

Access Paper or Ask Questions

Learning Feature-Preserving Portrait Editing from Generated Pairs

Jul 29, 2024

Bowei Chen, Tiancheng Zhi, Peihao Zhu, Shen Sang, Jing Liu, Linjie Luo

Abstract:Portrait editing is challenging for existing techniques due to difficulties in preserving subject features like identity. In this paper, we propose a training-based method leveraging auto-generated paired data to learn desired editing while ensuring the preservation of unchanged subject features. Specifically, we design a data generation process to create reasonably good training pairs for desired editing at low cost. Based on these pairs, we introduce a Multi-Conditioned Diffusion Model to effectively learn the editing direction and preserve subject features. During inference, our model produces accurate editing mask that can guide the inference process to further preserve detailed subject features. Experiments on costume editing and cartoon expression editing show that our method achieves state-of-the-art quality, quantitatively and qualitatively.

Via

Access Paper or Ask Questions

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Jan 06, 2023

Rameen Abdal, Hsin-Ying Lee, Peihao Zhu, Menglei Chai, Aliaksandr Siarohin, Peter Wonka, Sergey Tulyakov

Figure 1 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Figure 2 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Figure 3 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Figure 4 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Abstract:Modern 3D-GANs synthesize geometry and texture by training on large-scale datasets with a consistent structure. Training such models on stylized, artistic data, with often unknown, highly variable geometry, and camera information has not yet been shown possible. Can we train a 3D GAN on such artistic data, while maintaining multi-view consistency and texture quality? To this end, we propose an adaptation framework, where the source domain is a pre-trained 3D-GAN, while the target domain is a 2D-GAN trained on artistic datasets. We then distill the knowledge from a 2D generator to the source 3D generator. To do that, we first propose an optimization-based method to align the distributions of camera parameters across domains. Second, we propose regularizations necessary to learn high-quality texture, while avoiding degenerate geometric solutions, such as flat shapes. Third, we show a deformation-based technique for modeling exaggerated geometry of artistic domains, enabling -- as a byproduct -- personalized geometric editing. Finally, we propose a novel inversion method for 3D-GANs linking the latent spaces of the source and the target domains. Our contributions -- for the first time -- allow for the generation, editing, and animation of personalized artistic 3D avatars on artistic datasets.

* Project Page: https://rameenabdal.github.io/3DAvatarGAN/

Via

Access Paper or Ask Questions

Video2StyleGAN: Disentangling Local and Global Variations in a Video

May 30, 2022

Rameen Abdal, Peihao Zhu, Niloy J. Mitra, Peter Wonka

Figure 1 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Figure 2 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Figure 3 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Figure 4 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Abstract:Image editing using a pretrained StyleGAN generator has emerged as a powerful paradigm for facial editing, providing disentangled controls over age, expression, illumination, etc. However, the approach cannot be directly adopted for video manipulations. We hypothesize that the main missing ingredient is the lack of fine-grained and disentangled control over face location, face pose, and local facial expressions. In this work, we demonstrate that such a fine-grained control is indeed achievable using pretrained StyleGAN by working across multiple (latent) spaces (namely, the positional space, the W+ space, and the S space) and combining the optimization results across the multiple spaces. Building on this enabling component, we introduce Video2StyleGAN that takes a target image and driving video(s) to reenact the local and global locations and expressions from the driving video in the identity of the target image. We evaluate the effectiveness of our method over multiple challenging scenarios and demonstrate clear improvements over alternative approaches.

* Video : https://youtu.be/oUeXFyfdE1A

Via

Access Paper or Ask Questions

CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Dec 09, 2021

Rameen Abdal, Peihao Zhu, John Femiani, Niloy J. Mitra, Peter Wonka

Figure 1 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Figure 2 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Figure 3 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Figure 4 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Abstract:The success of StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images. However, such editing operations are either trained with semantic supervision or described using human guidance. In another development, the CLIP architecture has been trained with internet-scale image and text pairings and has been shown to be useful in several zero-shot learning settings. In this work, we investigate how to effectively link the pretrained latent spaces of StyleGAN and CLIP, which in turn allows us to automatically extract semantically labeled edit directions from StyleGAN, finding and naming meaningful edit operations without any additional human guidance. Technically, we propose two novel building blocks; one for finding interesting CLIP directions and one for labeling arbitrary directions in CLIP latent space. The setup does not assume any pre-determined labels and hence we do not require any additional supervised text/attributes to build the editing framework. We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible, and reveals interesting and non-trivial edit directions.

Via

Access Paper or Ask Questions

Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Oct 15, 2021

Peihao Zhu, Rameen Abdal, John Femiani, Peter Wonka

Figure 1 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Figure 2 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Figure 3 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Figure 4 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Abstract:We present a new method for one shot domain adaptation. The input to our method is trained GAN that can produce images in domain A and a single reference image I_B from domain B. The proposed algorithm can translate any output of the trained GAN from domain A to domain B. There are two main advantages of our method compared to the current state of the art: First, our solution achieves higher visual quality, e.g. by noticeably reducing overfitting. Second, our solution allows for more degrees of freedom to control the domain gap, i.e. what aspects of image I_B are used to define the domain B. Technically, we realize the new method by building on a pre-trained StyleGAN generator as GAN and a pre-trained CLIP model for representing the domain gap. We propose several new regularizers for controlling the domain gap to optimize the weights of the pre-trained StyleGAN generator to output images in domain B instead of domain A. The regularizers prevent the optimization from taking on too many attributes of the single reference image. Our results show significant visual improvements over the state of the art as well as multiple applications that highlight improved control.

* Video: https://youtu.be/RLBJ-mem9gM

Via

Access Paper or Ask Questions

Flow-Guided Video Inpainting with Scene Templates

Aug 29, 2021

Dong Lao, Peihao Zhu, Peter Wonka, Ganesh Sundaramoorthi

Figure 1 for Flow-Guided Video Inpainting with Scene Templates

Figure 2 for Flow-Guided Video Inpainting with Scene Templates

Figure 3 for Flow-Guided Video Inpainting with Scene Templates

Figure 4 for Flow-Guided Video Inpainting with Scene Templates

Abstract:We consider the problem of filling in missing spatio-temporal regions of a video. We provide a novel flow-based solution by introducing a generative model of images in relation to the scene (without missing regions) and mappings from the scene to images. We use the model to jointly infer the scene template, a 2D representation of the scene, and the mappings. This ensures consistency of the frame-to-frame flows generated to the underlying scene, reducing geometric distortions in flow based inpainting. The template is mapped to the missing regions in the video by a new L2-L1 interpolation scheme, creating crisp inpaintings and reducing common blur and distortion artifacts. We show on two benchmark datasets that our approach out-performs state-of-the-art quantitatively and in user studies.

Via

Access Paper or Ask Questions

Barbershop: GAN-based Image Compositing using Segmentation Masks

Jun 02, 2021

Peihao Zhu, Rameen Abdal, John Femiani, Peter Wonka

Figure 1 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Figure 2 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Figure 3 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Figure 4 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Abstract:Seamlessly blending features from multiple images is extremely challenging because of complex relationships in lighting, geometry, and partial occlusion which cause coupling between different parts of the image. Even though recent work on GANs enables synthesis of realistic hair or faces, it remains difficult to combine them into a single, coherent, and plausible image rather than a disjointed set of image patches. We present a novel solution to image blending, particularly for the problem of hairstyle transfer, based on GAN-inversion. We propose a novel latent space for image blending which is better at preserving detail and encoding spatial information, and propose a new GAN-embedding algorithm which is able to slightly modify images to conform to a common segmentation mask. Our novel representation enables the transfer of the visual properties from multiple reference images including specific details such as moles and wrinkles, and because we do image blending in a latent-space we are able to synthesize images that are coherent. Our approach avoids blending artifacts present in other approaches and finds a globally consistent image. Our results demonstrate a significant improvement over the current state of the art in a user study, with users preferring our blending solution over 95 percent of the time.

* Project page: https://zpdesu.github.io/Barbershop/ Video: https://youtu.be/ZU-yrAvoJfQ

Via

Access Paper or Ask Questions

Labels4Free: Unsupervised Segmentation using StyleGAN

Mar 27, 2021

Rameen Abdal, Peihao Zhu, Niloy Mitra, Peter Wonka

Figure 1 for Labels4Free: Unsupervised Segmentation using StyleGAN

Figure 2 for Labels4Free: Unsupervised Segmentation using StyleGAN

Figure 3 for Labels4Free: Unsupervised Segmentation using StyleGAN

Figure 4 for Labels4Free: Unsupervised Segmentation using StyleGAN

Abstract:We propose an unsupervised segmentation framework for StyleGAN generated objects. We build on two main observations. First, the features generated by StyleGAN hold valuable information that can be utilized towards training segmentation networks. Second, the foreground and background can often be treated to be largely independent and be composited in different ways. For our solution, we propose to augment the StyleGAN2 generator architecture with a segmentation branch and to split the generator into a foreground and background network. This enables us to generate soft segmentation masks for the foreground object in an unsupervised fashion. On multiple object classes, we report comparable results against state-of-the-art supervised segmentation networks, while against the best unsupervised segmentation approach we demonstrate a clear improvement, both in qualitative and quantitative metrics.

* "Project Page: https://rameenabdal.github.io/Labels4Free/"

Via

Access Paper or Ask Questions