Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Nov 15, 2024

Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie

Figure 1 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Figure 2 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Figure 3 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Figure 4 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Share this with someone who'll enjoy it:

Abstract:There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into \textit{intra-scale modeling}, which captures local spatial dependencies within each scale, and \textit{inter-scale modeling}, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling -- crucial for generating high-fidelity images -- we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256$\times$256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is avaiable at \url{https://github.com/OliverRensu/MVAR}.

View paper on

Share this with someone who'll enjoy it:

Title:M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Paper and Code