Picture for Zhongang Qi

Zhongang Qi

Mark

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Add code
Apr 10, 2025
Viaarxiv icon

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

Add code
Mar 28, 2025
Viaarxiv icon

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Add code
Mar 27, 2025
Viaarxiv icon

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Add code
Dec 27, 2024
Viaarxiv icon

DOGE: Towards Versatile Visual Document Grounding and Referring

Add code
Nov 26, 2024
Figure 1 for DOGE: Towards Versatile Visual Document Grounding and Referring
Figure 2 for DOGE: Towards Versatile Visual Document Grounding and Referring
Figure 3 for DOGE: Towards Versatile Visual Document Grounding and Referring
Figure 4 for DOGE: Towards Versatile Visual Document Grounding and Referring
Viaarxiv icon

mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Add code
Nov 22, 2024
Figure 1 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Figure 2 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Figure 3 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Figure 4 for mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Viaarxiv icon

Taming Rectified Flow for Inversion and Editing

Add code
Nov 07, 2024
Figure 1 for Taming Rectified Flow for Inversion and Editing
Figure 2 for Taming Rectified Flow for Inversion and Editing
Figure 3 for Taming Rectified Flow for Inversion and Editing
Figure 4 for Taming Rectified Flow for Inversion and Editing
Viaarxiv icon

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Add code
Sep 26, 2024
Figure 1 for E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Figure 2 for E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Figure 3 for E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Figure 4 for E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Viaarxiv icon

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Add code
Aug 23, 2024
Figure 1 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Figure 2 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Figure 3 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Figure 4 for CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
Viaarxiv icon

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

Add code
Aug 07, 2024
Figure 1 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Figure 2 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Figure 3 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Figure 4 for SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
Viaarxiv icon