Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Tarsier: Recipes for Training and Evaluating Large Video Description Models

Jun 30, 2024

Jiawei Wang, Liping Yuan, Yuchen Zhang

Figure 1 for Tarsier: Recipes for Training and Evaluating Large Video Description Models

Figure 2 for Tarsier: Recipes for Training and Evaluating Large Video Description Models

Figure 3 for Tarsier: Recipes for Training and Evaluating Large Video Description Models

Figure 4 for Tarsier: Recipes for Training and Evaluating Large Video Description Models

Share this with someone who'll enjoy it:

Abstract:Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a $+51.4\%$ advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a $+12.3\%$ advantage against GPT-4V and a $-6.7\%$ disadvantage against Gemini 1.5 Pro. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at \url{https://github.com/bytedance/tarsier}.

View paper on

Share this with someone who'll enjoy it:

Title:Tarsier: Recipes for Training and Evaluating Large Video Description Models

Paper and Code