Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Aug 29, 2024

Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Figure 1 for Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Figure 2 for Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Figure 3 for Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Figure 4 for Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Share this with someone who'll enjoy it:

Abstract:Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.

* ICPR 2024

View paper on

Share this with someone who'll enjoy it:

Title:Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Paper and Code