Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MonoFormer: One Transformer for Both Diffusion and Autoregression

Sep 24, 2024

Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang

Figure 1 for MonoFormer: One Transformer for Both Diffusion and Autoregression

Figure 2 for MonoFormer: One Transformer for Both Diffusion and Autoregression

Figure 3 for MonoFormer: One Transformer for Both Diffusion and Autoregression

Figure 4 for MonoFormer: One Transformer for Both Diffusion and Autoregression

Share this with someone who'll enjoy it:

Abstract:Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.

View paper on

Share this with someone who'll enjoy it:

Title:MonoFormer: One Transformer for Both Diffusion and Autoregression

Paper and Code