Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Text-to-Image Generation Via Energy-Based CLIP

Aug 30, 2024

Roy Ganz, Michael Elad

Figure 1 for Text-to-Image Generation Via Energy-Based CLIP

Figure 2 for Text-to-Image Generation Via Energy-Based CLIP

Figure 3 for Text-to-Image Generation Via Energy-Based CLIP

Figure 4 for Text-to-Image Generation Via Energy-Based CLIP

Share this with someone who'll enjoy it:

Abstract:Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.

View paper on

Share this with someone who'll enjoy it:

Title:Text-to-Image Generation Via Energy-Based CLIP

Paper and Code