Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Dec 06, 2023

Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor

Figure 1 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Figure 2 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Figure 3 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Figure 4 for MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Share this with someone who'll enjoy it:

Abstract:While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, the generation of spurious details that cannot be inferred from the given image. Dedicated methods for reducing hallucinations in image captioning largely focus on closed-vocabulary object tokens, ignoring most types of hallucinations that occur in practice. In this work, we propose MOCHa, an approach that harnesses advancements in reinforcement learning (RL) to address the sequence-level nature of hallucinations in an open-world setup. To optimize for caption fidelity to the input image, we leverage ground-truth reference captions as proxies to measure the logical consistency of generated captions. However, optimizing for caption fidelity alone fails to preserve the semantic adequacy of generations; therefore, we propose a multi-objective reward function that jointly targets these qualities, without requiring any strong supervision. We demonstrate that these goals can be simultaneously optimized with our framework, enhancing performance for various captioning models of different scales. Our qualitative and quantitative results demonstrate MOCHa's superior performance across various established metrics. We also demonstrate the benefit of our method in the open-vocabulary setting. To this end, we contribute OpenCHAIR, a new benchmark for quantifying open-vocabulary hallucinations in image captioning models, constructed using generative foundation models. We will release our code, benchmark, and trained models.

* Website Link: https://assafbk.github.io/mocha/

View paper on

Share this with someone who'll enjoy it:

Title:MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

Paper and Code