Abstract:We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT
Abstract:Image style transfer is an underdetermined problem, where a large number of solutions can explain the same constraint (i.e., the content and style). Most current methods always produce visually identical outputs, which lack of diversity. Recently, some methods have introduced an alternative diversity loss to train the feed-forward networks for diverse outputs, but they still suffer from many issues. In this paper, we propose a simple yet effective method for diversified style transfer. Our method can produce diverse outputs for arbitrary styles by incorporating the whitening and coloring transforms (WCT) with a novel deep feature perturbation (DFP) operation, which uses an orthogonal random noise matrix to perturb the deep image features while keeping the original style information unchanged. In addition, our method is learning-free and could be easily integrated into many existing WCT-based methods and empower them to generate diverse results. Experimental results demonstrate that our method can greatly increase the diversity while maintaining the quality of stylization. And several new user studies show that users could obtain more satisfactory results through the diversified approaches based on our method.