Picture for Linli Yao

Linli Yao

Temporal Reasoning Transfer from Text to Video

Add code
Oct 08, 2024
Figure 1 for Temporal Reasoning Transfer from Text to Video
Figure 2 for Temporal Reasoning Transfer from Text to Video
Figure 3 for Temporal Reasoning Transfer from Text to Video
Figure 4 for Temporal Reasoning Transfer from Text to Video
Viaarxiv icon

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Add code
Jun 24, 2024
Viaarxiv icon

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

Add code
May 31, 2024
Figure 1 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 2 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 3 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Figure 4 for DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Viaarxiv icon

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Add code
Apr 16, 2024
Figure 1 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 2 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 3 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Figure 4 for LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Viaarxiv icon

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Add code
Dec 04, 2023
Figure 1 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Figure 2 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Figure 3 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Figure 4 for TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Viaarxiv icon

Edit As You Wish: Video Description Editing with Multi-grained Commands

Add code
May 15, 2023
Viaarxiv icon

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Add code
Apr 21, 2023
Figure 1 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Figure 2 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Figure 3 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Figure 4 for Rethinking Benchmarks for Cross-modal Image-text Retrieval
Viaarxiv icon

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Add code
Nov 17, 2022
Viaarxiv icon

Image Difference Captioning with Pre-training and Contrastive Learning

Add code
Feb 09, 2022
Figure 1 for Image Difference Captioning with Pre-training and Contrastive Learning
Figure 2 for Image Difference Captioning with Pre-training and Contrastive Learning
Figure 3 for Image Difference Captioning with Pre-training and Contrastive Learning
Figure 4 for Image Difference Captioning with Pre-training and Contrastive Learning
Viaarxiv icon

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

Add code
Apr 12, 2020
Figure 1 for YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos
Figure 2 for YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos
Figure 3 for YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos
Figure 4 for YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos
Viaarxiv icon