Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weichen Fan

**CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models**

Mar 24, 2025

Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, Ziwei Liu

Abstract:Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

Via

Access Paper or Ask Questions

RepVideo: Rethinking Cross-Layer Representation for Video Generation

Jan 15, 2025

Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu

Figure 1 for RepVideo: Rethinking Cross-Layer Representation for Video Generation

Figure 2 for RepVideo: Rethinking Cross-Layer Representation for Video Generation

Figure 3 for RepVideo: Rethinking Cross-Layer Representation for Video Generation

Figure 4 for RepVideo: Rethinking Cross-Layer Representation for Video Generation

Abstract:Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.

* Project page: https://vchitect.github.io/RepVid-Webpage

Via

Access Paper or Ask Questions

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Jan 14, 2025

Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan(+9 more)

Figure 1 for Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Figure 2 for Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Figure 3 for Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Figure 4 for Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Abstract:We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.

Via

Access Paper or Ask Questions

Link-Context Learning for Multimodal LLMs

Aug 15, 2023

Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, Ziwei Liu

Abstract:The ability to learn from context with novel concepts, and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets, recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores training-free few-shot learning, where models are encouraged to ``learn to learn" from limited tasks and generalize to unseen tasks. In this work, we propose link-context learning (LCL), which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal relationship between the support set and the query set. By providing demonstrations with causal links, LCL guides the model to discern not only the analogy but also the underlying causal associations between data points, which empowers MLLMs to recognize unseen images and understand novel concepts more effectively. To facilitate the evaluation of this novel approach, we introduce the ISEKAI dataset, comprising exclusively of unseen generated image-label pairs designed for link-context learning. Extensive experiments show that our LCL-MLLM exhibits strong link-context learning capabilities to novel concepts over vanilla MLLMs. Code and data will be released at https://github.com/isekai-portal/Link-Context-Learning.

* 10 pages, 8 figures

Via

Access Paper or Ask Questions

Hierarchy Flow For High-Fidelity Image-to-Image Translation

Aug 14, 2023

Weichen Fan, Jinghuan Chen, Ziwei Liu

Figure 1 for Hierarchy Flow For High-Fidelity Image-to-Image Translation

Figure 2 for Hierarchy Flow For High-Fidelity Image-to-Image Translation

Figure 3 for Hierarchy Flow For High-Fidelity Image-to-Image Translation

Figure 4 for Hierarchy Flow For High-Fidelity Image-to-Image Translation

Abstract:Image-to-image (I2I) translation comprises a wide spectrum of tasks. Here we divide this problem into three levels: strong-fidelity translation, normal-fidelity translation, and weak-fidelity translation, indicating the extent to which the content of the original image is preserved. Although existing methods achieve good performance in weak-fidelity translation, they fail to fully preserve the content in both strong- and normal-fidelity tasks, e.g. sim2real, style transfer and low-level vision. In this work, we propose Hierarchy Flow, a novel flow-based model to achieve better content preservation during translation. Specifically, 1) we first unveil the drawbacks of standard flow-based models when applied to I2I translation. 2) Next, we propose a new design, namely hierarchical coupling for reversible feature transformation and multi-scale modeling, to constitute Hierarchy Flow. 3) Finally, we present a dedicated aligned-style loss for a better trade-off between content preservation and stylization during translation. Extensive experiments on a wide range of I2I translation benchmarks demonstrate that our approach achieves state-of-the-art performance, with convincing advantages in both strong- and normal-fidelity tasks. Code and models will be at https://github.com/WeichenFan/HierarchyFlow.

* arXiv admin note: text overlap with arXiv:2207.01909

Via

Access Paper or Ask Questions

StyleFlow For Content-Fixed Image to Image Translation

Jul 05, 2022

Weichen Fan, Jinghuan Chen, Jiabin Ma, Jun Hou, Shuai Yi

Figure 1 for StyleFlow For Content-Fixed Image to Image Translation

Figure 2 for StyleFlow For Content-Fixed Image to Image Translation

Figure 3 for StyleFlow For Content-Fixed Image to Image Translation

Figure 4 for StyleFlow For Content-Fixed Image to Image Translation

Abstract:Image-to-image (I2I) translation is a challenging topic in computer vision. We divide this problem into three tasks: strongly constrained translation, normally constrained translation, and weakly constrained translation. The constraint here indicates the extent to which the content or semantic information in the original image is preserved. Although previous approaches have achieved good performance in weakly constrained tasks, they failed to fully preserve the content in both strongly and normally constrained tasks, including photo-realism synthesis, style transfer, and colorization, etc. To achieve content-preserving transfer in strongly constrained and normally constrained tasks, we propose StyleFlow, a new I2I translation model that consists of normalizing flows and a novel Style-Aware Normalization (SAN) module. With the invertible network structure, StyleFlow first projects input images into deep feature space in the forward pass, while the backward pass utilizes the SAN module to perform content-fixed feature transformation and then projects back to image space. Our model supports both image-guided translation and multi-modal synthesis. We evaluate our model in several I2I translation benchmarks, and the results show that the proposed model has advantages over previous methods in both strongly constrained and normally constrained tasks.

Via

Access Paper or Ask Questions

InvNorm: Domain Generalization for Object Detection in Gastrointestinal Endoscopy

May 05, 2022

Weichen Fan, Yuanbo Yang, Kunpeng Qiu, Shuo Wang, Yongxin Guo

Figure 1 for InvNorm: Domain Generalization for Object Detection in Gastrointestinal Endoscopy

Figure 2 for InvNorm: Domain Generalization for Object Detection in Gastrointestinal Endoscopy

Figure 3 for InvNorm: Domain Generalization for Object Detection in Gastrointestinal Endoscopy

Figure 4 for InvNorm: Domain Generalization for Object Detection in Gastrointestinal Endoscopy

Abstract:Domain Generalization is a challenging topic in computer vision, especially in Gastrointestinal Endoscopy image analysis. Due to several device limitations and ethical reasons, current open-source datasets are typically collected on a limited number of patients using the same brand of sensors. Different brands of devices and individual differences will significantly affect the model's generalizability. Therefore, to address the generalization problem in GI(Gastrointestinal) endoscopy, we propose a multi-domain GI dataset and a light, plug-in block called InvNorm(Invertible Normalization), which could achieve a better generalization performance in any structure. Previous DG(Domain Generalization) methods fail to achieve invertible transformation, which would lead to some misleading augmentation. Moreover, these models would be more likely to lead to medical ethics issues. Our method utilizes normalizing flow to achieve invertible and explainable style normalization to address the problem. The effectiveness of InvNorm is demonstrated on a wide range of tasks, including GI recognition, GI object detection, and natural image recognition.

Via

Access Paper or Ask Questions