Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Boyang Zheng

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

May 15, 2025

Bingda Tang, Boyang Zheng, Xichen Pan, Sayak Paul, Saining Xie

Abstract:This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.

Via

Access Paper or Ask Questions

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

May 24, 2024

Boyang Zheng, Jinjin Gu, Shijun Li, Chao Dong

Figure 1 for LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Figure 2 for LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Figure 3 for LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Figure 4 for LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Abstract:The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose $\textbf{LM4LV}$, a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms.

Via

Access Paper or Ask Questions

Understanding and Improving Adversarial Attacks on Latent Diffusion Model

Oct 07, 2023

Boyang Zheng, Chumeng Liang, Xiaoyu Wu, Yan Liu

Figure 1 for Understanding and Improving Adversarial Attacks on Latent Diffusion Model

Figure 2 for Understanding and Improving Adversarial Attacks on Latent Diffusion Model

Figure 3 for Understanding and Improving Adversarial Attacks on Latent Diffusion Model

Figure 4 for Understanding and Improving Adversarial Attacks on Latent Diffusion Model

Abstract:Latent Diffusion Model (LDM) has emerged as a leading tool in image generation, particularly with its capability in few-shot generation. This capability also presents risks, notably in unauthorized artwork replication and misinformation generation. In response, adversarial attacks have been designed to safeguard personal images from being used as reference data. However, existing adversarial attacks are predominantly empirical, lacking a solid theoretical foundation. In this paper, we introduce a comprehensive theoretical framework for understanding adversarial attacks on LDM. Based on the framework, we propose a novel adversarial attack that exploits a unified target to guide the adversarial attack both in the forward and the reverse process of LDM. We provide empirical evidences that our method overcomes the offset problem of the optimization of adversarial attacks in existing methods. Through rigorous experiments, our findings demonstrate that our method outperforms current attacks and is able to generalize over different state-of-the-art few-shot generation pipelines based on LDM. Our method can serve as a stronger and efficient tool for people exposed to the risk of data privacy and security to protect themselves in the new era of powerful generative models. The code is available on GitHub: https://github.com/CaradryanLiang/ImprovedAdvDM.git.

Via

Access Paper or Ask Questions