Picture for Yuqi Huo

Yuqi Huo

Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining

Add code
Oct 21, 2024
Viaarxiv icon

Exploring the Design Space of Visual Context Representation in Video MLLMs

Add code
Oct 17, 2024
Figure 1 for Exploring the Design Space of Visual Context Representation in Video MLLMs
Figure 2 for Exploring the Design Space of Visual Context Representation in Video MLLMs
Figure 3 for Exploring the Design Space of Visual Context Representation in Video MLLMs
Figure 4 for Exploring the Design Space of Visual Context Representation in Video MLLMs
Viaarxiv icon

Baichuan-Omni Technical Report

Add code
Oct 11, 2024
Figure 1 for Baichuan-Omni Technical Report
Figure 2 for Baichuan-Omni Technical Report
Figure 3 for Baichuan-Omni Technical Report
Figure 4 for Baichuan-Omni Technical Report
Viaarxiv icon

Towards Event-oriented Long Video Understanding

Add code
Jun 20, 2024
Viaarxiv icon

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

Add code
Jun 13, 2024
Viaarxiv icon

VDT: An Empirical Study on Video Diffusion with Transformers

Add code
May 22, 2023
Viaarxiv icon

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Add code
Feb 13, 2023
Viaarxiv icon

LGDN: Language-Guided Denoising Network for Video-Language Modeling

Add code
Oct 03, 2022
Figure 1 for LGDN: Language-Guided Denoising Network for Video-Language Modeling
Figure 2 for LGDN: Language-Guided Denoising Network for Video-Language Modeling
Figure 3 for LGDN: Language-Guided Denoising Network for Video-Language Modeling
Figure 4 for LGDN: Language-Guided Denoising Network for Video-Language Modeling
Viaarxiv icon

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

Add code
Apr 15, 2022
Figure 1 for COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Figure 2 for COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Figure 3 for COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Figure 4 for COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Viaarxiv icon

WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model

Add code
Oct 27, 2021
Figure 1 for WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model
Figure 2 for WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model
Figure 3 for WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model
Figure 4 for WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model
Viaarxiv icon