Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anastasija Ilic

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Mar 12, 2024

Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino

Figure 1 for Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Figure 2 for Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Figure 3 for Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Figure 4 for Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Abstract:The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.

* 9 pages, 6 figures

Via

Access Paper or Ask Questions