Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lishuai Gao

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Dec 06, 2024

Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao

Abstract:Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

Via

Access Paper or Ask Questions

Video Temporal Relationship Mining for Data-Efficient Person Re-identification

Oct 01, 2021

Siyu Chen, Dengjie Li, Lishuai Gao, Fan Liang, Wei Zhang, Lin Ma

Figure 1 for Video Temporal Relationship Mining for Data-Efficient Person Re-identification

Figure 2 for Video Temporal Relationship Mining for Data-Efficient Person Re-identification

Figure 3 for Video Temporal Relationship Mining for Data-Efficient Person Re-identification

Figure 4 for Video Temporal Relationship Mining for Data-Efficient Person Re-identification

Abstract:This paper is a technical report to our submission to the ICCV 2021 VIPriors Re-identification Challenge. In order to make full use of the visual inductive priors of the data, we treat the query and gallery images of the same identity as continuous frames in a video sequence. And we propose one novel post-processing strategy for video temporal relationship mining, which not only calculates the distance matrix between query and gallery images, but also the matrix between gallery images. The initial query image is used to retrieve the most similar image from the gallery, then the retrieved image is treated as a new query to retrieve its most similar image from the gallery. By iteratively searching for the closest image, we can achieve accurate image retrieval and finally obtain a robust retrieval sequence.

Via

Access Paper or Ask Questions