Picture for Zhongang Qi

Zhongang Qi

Mark

DOGE: Towards Versatile Visual Document Grounding and Referring

Add code
Nov 26, 2024
Viaarxiv icon

mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Add code
Nov 22, 2024
Figure 1 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Figure 2 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Figure 3 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Figure 4 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Viaarxiv icon

Taming Rectified Flow for Inversion and Editing

Add code
Nov 07, 2024
Figure 1 for Taming Rectified Flow for Inversion and Editing
Figure 2 for Taming Rectified Flow for Inversion and Editing
Figure 3 for Taming Rectified Flow for Inversion and Editing
Figure 4 for Taming Rectified Flow for Inversion and Editing
Viaarxiv icon

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Add code
Sep 26, 2024
Viaarxiv icon

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Add code
Aug 23, 2024
Figure 1 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Figure 2 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Figure 3 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Figure 4 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Viaarxiv icon

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

Add code
Aug 07, 2024
Figure 1 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Figure 2 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Figure 3 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Figure 4 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Viaarxiv icon

EA-VTR: Event-Aware Video-Text Retrieval

Add code
Jul 10, 2024
Viaarxiv icon

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Add code
Jul 10, 2024
Viaarxiv icon

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Add code
Jun 05, 2024
Viaarxiv icon

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

Add code
Mar 15, 2024
Viaarxiv icon