Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Feihong He

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

Feb 28, 2024

Ziying Pan, Kun Wang, Gang Li, Feihong He, Xiwang Li, Yongxuan Lai

Figure 1 for FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

Figure 2 for FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

Figure 3 for FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

Figure 4 for FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

Abstract:The class-conditional image generation based on diffusion models is renowned for generating high-quality and diverse images. However, most prior efforts focus on generating images for general categories, e.g., 1000 classes in ImageNet-1k. A more challenging task, large-scale fine-grained image generation, remains the boundary to explore. In this work, we present a parameter-efficient strategy, called FineDiffusion, to fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories. FineDiffusion significantly accelerates training and reduces storage overhead by only fine-tuning tiered class embedder, bias terms, and normalization layers' parameters. To further improve the image generation quality of fine-grained categories, we propose a novel sampling method for fine-grained image generation, which utilizes superclass-conditioned guidance, specifically tailored for fine-grained categories, to replace the conventional classifier-free guidance sampling. Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x training speed-up and requires storing merely 1.77% of the total model parameters, while achieving state-of-the-art FID of 9.776 on image generation of 10,000 classes. Extensive qualitative and quantitative experiments demonstrate the superiority of our method compared to other parameter-efficient fine-tuning methods. The code and more generated results are available at our project website: https://finediffusion.github.io/.

Via

Access Paper or Ask Questions

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

Jan 28, 2024

Feihong He, Gang Li, Mengyuan Zhang, Leilei Yan, Lingyu Si, Fanzhang Li

Abstract:The rapid development of generative diffusion models has significantly advanced the field of style transfer. However, most current style transfer methods based on diffusion models typically involve a slow iterative optimization process, e.g., model fine-tuning and textual inversion of style concept. In this paper, we introduce FreeStyle, an innovative style transfer method built upon a pre-trained large diffusion model, requiring no further optimization. Besides, our method enables style transfer only through a text description of the desired style, eliminating the necessity of style images. Specifically, we propose a dual-stream encoder and single-stream decoder architecture, replacing the conventional U-Net in diffusion models. In the dual-stream encoder, two distinct branches take the content image and style text prompt as inputs, achieving content and style decoupling. In the decoder, we further modulate features from the dual streams based on a given content image and the corresponding style text prompt for precise style transfer. Our experimental results demonstrate high-quality synthesis and fidelity of our method across various content images and style text prompts. The code and more results are available at our project website:https://freestylefreelunch.github.io/.

Via

Access Paper or Ask Questions

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Jan 25, 2024

Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal(+2 more)

Figure 1 for Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Figure 2 for Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Figure 3 for Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Figure 4 for Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.

* 27 pages, 23 figures

Via

Access Paper or Ask Questions

PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification

Oct 05, 2023

Feihong He, Gang Li, Lingyu Si, Leilei Yan, Fanzhang Li, Fuchun Sun

Abstract:Few-shot image classification has received considerable attention for addressing the challenge of poor classification performance with limited samples in novel classes. However, numerous studies have employed sophisticated learning strategies and diversified feature extraction methods to address this issue. In this paper, we propose our method called PrototypeFormer, which aims to significantly advance traditional few-shot image classification approaches by exploring prototype relationships. Specifically, we utilize a transformer architecture to build a prototype extraction module, aiming to extract class representations that are more discriminative for few-shot classification. Additionally, during the model training process, we propose a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios. Despite its simplicity, the method performs remarkably well, with no bells and whistles. We have experimented with our approach on several popular few-shot image classification benchmark datasets, which shows that our method outperforms all current state-of-the-art methods. In particular, our method achieves 97.07% and 90.88% on 5-way 5-shot and 5-way 1-shot tasks of miniImageNet, which surpasses the state-of-the-art results with accuracy of 7.27% and 8.72%, respectively. The code will be released later.

* Submitted to AAAI2024

Via

Access Paper or Ask Questions

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

Sep 15, 2023

Feihong He, Gang Li, Lingyu Si, Leilei Yan, Shimeng Hou, Hongwei Dong, Fanzhang Li

Figure 1 for Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

Figure 2 for Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

Figure 3 for Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

Figure 4 for Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

Abstract:Image cartoonization has attracted significant interest in the field of image generation. However, most of the existing image cartoonization techniques require re-training models using images of cartoon style. In this paper, we present CartoonDiff, a novel training-free sampling approach which generates image cartoonization using diffusion transformer models. Specifically, we decompose the reverse process of diffusion models into the semantic generation phase and the detail generation phase. Furthermore, we implement the image cartoonization process by normalizing high-frequency signal of the noisy image in specific denoising steps. CartoonDiff doesn't require any additional reference images, complex model designs, or the tedious adjustment of multiple parameters. Extensive experimental results show the powerful ability of our CartoonDiff. The project page is available at: https://cartoondiff.github.io/

* 5 pages,5 figures

Via

Access Paper or Ask Questions