Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Oct 16, 2023

Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, Lei Hou, Juanzi Li

Figure 1 for BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Figure 2 for BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Figure 3 for BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Figure 4 for BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Share this with someone who'll enjoy it:

Abstract:Building models that generate textual responses to user instructions for videos is a practical and challenging topic, as it requires both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos aligned with brief descriptions. In this paper, we introduce BiLL-VTG, a fast adaptive framework that leverages large language models (LLMs) to reasoning on videos based on essential lightweight visual tools. Specifically, we reveal the key to response specific instructions is the concentration on relevant video events, and utilize two visual tools of structured scene graph generation and descriptive image caption generation to gather and represent the events information. Thus, a LLM equipped with world knowledge is adopted as the reasoning agent to achieve the response by performing multiple reasoning steps on specified video events.To address the difficulty of specifying events from agent, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm based on the efficient Hungarian matching to localize corresponding video events using linguistic instructions, enabling LLMs to interact with long videos. Extensive experiments on two typical video-based texts generations tasks show that our tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance.

View paper on

Share this with someone who'll enjoy it:

Title:BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Paper and Code