Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Oct 10, 2023

Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

Figure 1 for TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Figure 2 for TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Figure 3 for TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Figure 4 for TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Share this with someone who'll enjoy it:

Abstract:Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions. Our code, data, and results are available on our project page: https://vis-www.cs.umass.edu/TextPSG.

* Accepted by ICCV 2023

View paper on

Share this with someone who'll enjoy it:

Title:TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Paper and Code