Picture for Qi Dai

Qi Dai

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Add code
Mar 20, 2025
Viaarxiv icon

HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard

Add code
Mar 18, 2025
Viaarxiv icon

HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

Add code
Mar 14, 2025
Viaarxiv icon

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Add code
Mar 03, 2025
Viaarxiv icon

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

Add code
Feb 12, 2025
Viaarxiv icon

UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

Add code
Dec 14, 2024
Figure 1 for UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Figure 2 for UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Figure 3 for UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Figure 4 for UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Viaarxiv icon

MageBench: Bridging Large Multimodal Models to Agents

Add code
Dec 05, 2024
Viaarxiv icon

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Add code
Nov 26, 2024
Viaarxiv icon

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

Add code
Nov 26, 2024
Figure 1 for LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
Figure 2 for LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
Figure 3 for LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
Figure 4 for LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
Viaarxiv icon

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

Add code
Nov 20, 2024
Viaarxiv icon