Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kunyu Shi

Enhancing Vision-Language Pre-training with Rich Supervisions

Mar 05, 2024

Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto

Figure 1 for Enhancing Vision-Language Pre-training with Rich Supervisions

Figure 2 for Enhancing Vision-Language Pre-training with Rich Supervisions

Figure 3 for Enhancing Vision-Language Pre-training with Rich Supervisions

Figure 4 for Enhancing Vision-Language Pre-training with Rich Supervisions

Abstract:We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Mar 04, 2024

Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto

Figure 1 for Non-autoregressive Sequence-to-Sequence Vision-Language Models

Figure 2 for Non-autoregressive Sequence-to-Sequence Vision-Language Models

Figure 3 for Non-autoregressive Sequence-to-Sequence Vision-Language Models

Figure 4 for Non-autoregressive Sequence-to-Sequence Vision-Language Models

Abstract:Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts

May 11, 2023

Zhaoyang Zhang, Yantao Shen, Kunyu Shi, Zhaowei Cai, Jun Fang, Siqi Deng, Hao Yang, Davide Modolo, Zhuowen Tu, Stefano Soatto

Abstract:We present a sequence-to-sequence vision-language model whose parameters are jointly trained on all tasks (all for one) and fully shared among multiple tasks (one for all), resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.

Via

Access Paper or Ask Questions