Picture for Kaihang Pan

Kaihang Pan

OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions

Add code
Dec 22, 2025
Viaarxiv icon

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Add code
Jun 10, 2025
Figure 1 for What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Figure 2 for What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Figure 3 for What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Figure 4 for What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Viaarxiv icon

FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Add code
Jun 05, 2025
Viaarxiv icon

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Add code
May 18, 2025
Viaarxiv icon

Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Add code
May 12, 2025
Viaarxiv icon

On Path to Multimodal Generalist: General-Level and General-Bench

Add code
May 07, 2025
Viaarxiv icon

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Add code
Apr 22, 2025
Viaarxiv icon

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Add code
Apr 20, 2025
Figure 1 for Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Figure 2 for Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Figure 3 for Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Figure 4 for Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Viaarxiv icon

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

Add code
Dec 13, 2024
Figure 1 for Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
Figure 2 for Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
Figure 3 for Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
Figure 4 for Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
Viaarxiv icon

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Add code
Nov 29, 2024
Viaarxiv icon