Picture for Yongdong Zhang

Yongdong Zhang

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Add code
Mar 25, 2025
Viaarxiv icon

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Add code
Mar 18, 2025
Viaarxiv icon

OmniPrism: Learning Disentangled Visual Concept for Image Generation

Add code
Dec 16, 2024
Figure 1 for OmniPrism: Learning Disentangled Visual Concept for Image Generation
Figure 2 for OmniPrism: Learning Disentangled Visual Concept for Image Generation
Figure 3 for OmniPrism: Learning Disentangled Visual Concept for Image Generation
Figure 4 for OmniPrism: Learning Disentangled Visual Concept for Image Generation
Viaarxiv icon

LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation

Add code
Dec 13, 2024
Figure 1 for LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation
Figure 2 for LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation
Figure 3 for LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation
Figure 4 for LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation
Viaarxiv icon

A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

Add code
Dec 12, 2024
Viaarxiv icon

T-SVG: Text-Driven Stereoscopic Video Generation

Add code
Dec 12, 2024
Viaarxiv icon

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Add code
Nov 23, 2024
Figure 1 for Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing
Figure 2 for Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing
Figure 3 for Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing
Figure 4 for Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing
Viaarxiv icon

It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment

Add code
Nov 16, 2024
Figure 1 for It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment
Figure 2 for It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment
Figure 3 for It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment
Figure 4 for It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment
Viaarxiv icon

MILP-StuDio: MILP Instance Generation via Block Structure Decomposition

Add code
Oct 31, 2024
Figure 1 for MILP-StuDio: MILP Instance Generation via Block Structure Decomposition
Figure 2 for MILP-StuDio: MILP Instance Generation via Block Structure Decomposition
Figure 3 for MILP-StuDio: MILP Instance Generation via Block Structure Decomposition
Figure 4 for MILP-StuDio: MILP Instance Generation via Block Structure Decomposition
Viaarxiv icon

Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models

Add code
Oct 19, 2024
Figure 1 for Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
Figure 2 for Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
Figure 3 for Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
Figure 4 for Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
Viaarxiv icon