Picture for Linli Yao

Linli Yao

Generative Frame Sampler for Long Video Understanding

Add code
Mar 12, 2025
Viaarxiv icon

Temporal Reasoning Transfer from Text to Video

Add code
Oct 08, 2024
Figure 1 for Temporal Reasoning Transfer from Text to Video
Figure 2 for Temporal Reasoning Transfer from Text to Video
Figure 3 for Temporal Reasoning Transfer from Text to Video
Figure 4 for Temporal Reasoning Transfer from Text to Video
Viaarxiv icon

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Add code
Jun 24, 2024
Figure 1 for UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
Figure 2 for UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
Figure 3 for UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
Figure 4 for UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
Viaarxiv icon

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

Add code
May 31, 2024
Figure 1 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 2 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 3 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 4 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Viaarxiv icon

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Add code
Apr 16, 2024
Figure 1 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 2 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 3 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 4 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Viaarxiv icon

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Add code
Dec 04, 2023
Figure 1 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Figure 2 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Figure 3 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Figure 4 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Viaarxiv icon

Edit As You Wish: Video Description Editing with Multi-grained Commands

Add code
May 15, 2023
Viaarxiv icon

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Add code
Apr 21, 2023
Figure 1 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Figure 2 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Figure 3 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Figure 4 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Viaarxiv icon

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Add code
Nov 17, 2022
Viaarxiv icon

Image Difference Captioning with Pre-training and Contrastive Learning

Add code
Feb 09, 2022
Figure 1 for Image Difference Captioning with Pre-training and Contrastive Learning
Figure 2 for Image Difference Captioning with Pre-training and Contrastive Learning
Figure 3 for Image Difference Captioning with Pre-training and Contrastive Learning
Figure 4 for Image Difference Captioning with Pre-training and Contrastive Learning
Viaarxiv icon