Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:What If We Recaption Billions of Web Images with LLaMA-3?

Jun 12, 2024

Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng(+2 more)

Figure 1 for What If We Recaption Billions of Web Images with LLaMA-3?

Figure 2 for What If We Recaption Billions of Web Images with LLaMA-3?

Figure 3 for What If We Recaption Billions of Web Images with LLaMA-3?

Figure 4 for What If We Recaption Billions of Web Images with LLaMA-3?

Share this with someone who'll enjoy it:

Abstract:Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

* * denotes equal contributions

View paper on

Share this with someone who'll enjoy it:

Title:What If We Recaption Billions of Web Images with LLaMA-3?

Paper and Code