Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Apr 22, 2025

Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai

Figure 1 for Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Figure 2 for Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Figure 3 for Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Figure 4 for Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Share this with someone who'll enjoy it:

Abstract:Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

* Homepage: https://vlm-reg.github.io/

View paper on

Share this with someone who'll enjoy it:

Title:Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Paper and Code