Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Controllable Dense Captioner with Multimodal Embedding Bridging

Feb 01, 2024

Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Fang Wan, Qixiang Ye

Share this with someone who'll enjoy it:

Abstract:In this paper, we propose a controllable dense captioner (ControlCap), which accommodates user's intention to dense captioning by introducing linguistic guidance. ControlCap is defined as a multimodal embedding bridging architecture, which comprises multimodal embedding generation (MEG) module and bi-directional embedding bridging (BEB) module. While MEG module represents objects/regions by combining embeddings of detailed information with context-aware ones, it also endows ControlCap the adaptability to specialized controls by utilizing them as linguistic guidance. BEB module aligns the linguistic guidance with visual embeddings through borrowing/returning features from/to the visual domain and gathering such features to predict text descriptions. Experiments on Visual Genome and VG-COCO datasets show that ControlCap respectively outperforms the state-of-the-art methods by 1.5% and 3.7% (mAP). Last but not least, with the capability of converting region-category pairs to region-text pairs, ControlCap is able to act as a powerful data engine for dense captioning. Code is available at https://github.com/callsys/ControlCap.

* https://github.com/callsys/ControlCap

View paper on

Share this with someone who'll enjoy it:

Title:Controllable Dense Captioner with Multimodal Embedding Bridging

Paper and Code