Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Oct 06, 2024

Wenbo Li, Guohao Li, Zhibin Lan, Xue Xu, Wanru Zhuang, Jiachen Liu, Xinyan Xiao, Jinsong Su

Figure 1 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Figure 2 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Figure 3 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Figure 4 for Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Share this with someone who'll enjoy it:

Abstract:Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation quality.

View paper on

Share this with someone who'll enjoy it:

Title:Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Paper and Code