Picture for Xinlong Chen

Xinlong Chen

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Add code
Feb 04, 2026
Viaarxiv icon

Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

Add code
Feb 02, 2026
Viaarxiv icon

ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Add code
Jan 30, 2026
Viaarxiv icon

DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

Add code
Jan 27, 2026
Viaarxiv icon

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Add code
Dec 17, 2025
Viaarxiv icon

VABench: A Comprehensive Benchmark for Audio-Video Generation

Add code
Dec 10, 2025
Viaarxiv icon

The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

Add code
Dec 09, 2025
Figure 1 for The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Figure 2 for The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Figure 3 for The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Figure 4 for The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Viaarxiv icon

VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks

Add code
Jun 10, 2025
Figure 1 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Figure 2 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Figure 3 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Figure 4 for VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Viaarxiv icon

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Add code
May 27, 2025
Viaarxiv icon

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Add code
Apr 14, 2025
Figure 1 for Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Figure 2 for Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Figure 3 for Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Figure 4 for Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Viaarxiv icon