Picture for Zhaoyang Liu

Zhaoyang Liu

EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks

Add code
Mar 14, 2025
Viaarxiv icon

AudioX: Diffusion Transformer for Anything-to-Audio Generation

Add code
Mar 13, 2025
Viaarxiv icon

ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Add code
Dec 25, 2024
Viaarxiv icon

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Add code
Dec 06, 2024
Figure 1 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Figure 2 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Figure 3 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Figure 4 for Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Viaarxiv icon

ONION: Physics-Informed Deep Learning Model for Line Integral Diagnostics Across Fusion Devices

Add code
Nov 27, 2024
Viaarxiv icon

What is Wrong with Perplexity for Long-context Language Modeling?

Add code
Oct 31, 2024
Figure 1 for What is Wrong with Perplexity for Long-context Language Modeling?
Figure 2 for What is Wrong with Perplexity for Long-context Language Modeling?
Figure 3 for What is Wrong with Perplexity for Long-context Language Modeling?
Figure 4 for What is Wrong with Perplexity for Long-context Language Modeling?
Viaarxiv icon

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

Add code
Jul 30, 2024
Figure 1 for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Figure 2 for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Figure 3 for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Figure 4 for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Viaarxiv icon

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Add code
Jun 12, 2024
Viaarxiv icon

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Add code
Jun 06, 2024
Viaarxiv icon

LLMs Meet Multimodal Generation and Editing: A Survey

Add code
May 29, 2024
Viaarxiv icon