Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Nov 23, 2021

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang

Figure 1 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Figure 2 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Figure 3 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Figure 4 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Share this with someone who'll enjoy it:

Abstract:In this paper, we propose UNICORN, a vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture. Specifically, we quantize each box into four discrete box tokens and serialize them as a sequence, which can be integrated with text tokens. We formulate all VL problems as a generation task, where the target sequence consists of the integrated text and box tokens. We then train a transformer encoder-decoder to predict the target in an auto-regressive manner. With such a unified framework and input-output format, UNICORN achieves comparable performance to task-specific state of the art on 7 VL benchmarks, covering the visual grounding, grounded captioning, visual question answering, and image captioning tasks. When trained with multi-task finetuning, UNICORN can approach different VL tasks with a single set of parameters, thus crossing downstream task boundary. We show that having a single model not only saves parameters, but also further boosts the model performance on certain tasks. Finally, UNICORN shows the capability of generalizing to new tasks such as ImageNet object localization.

View paper on

Share this with someone who'll enjoy it:

Title:Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Paper and Code