Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Sep 03, 2023

Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang

Share this with someone who'll enjoy it:

Abstract:Large-scale text-to-image diffusion models have shown impressive capabilities across various generative tasks, enabled by strong vision-language alignment obtained through pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding.

View paper on

Share this with someone who'll enjoy it:

Title:VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Paper and Code