Picture for Yuhang Zang

Yuhang Zang

Unified Reward Model for Multimodal Understanding and Generation

Add code
Mar 07, 2025
Viaarxiv icon

Visual-RFT: Visual Reinforcement Fine-Tuning

Add code
Mar 03, 2025
Viaarxiv icon

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Add code
Feb 18, 2025
Viaarxiv icon

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Add code
Feb 12, 2025
Viaarxiv icon

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Add code
Feb 07, 2025
Viaarxiv icon

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Add code
Jan 21, 2025
Figure 1 for InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Figure 2 for InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Figure 3 for InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Figure 4 for InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Viaarxiv icon

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Add code
Jan 09, 2025
Viaarxiv icon

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Add code
Jan 06, 2025
Figure 1 for BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Figure 2 for BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Figure 3 for BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Figure 4 for BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Viaarxiv icon

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Add code
Jan 06, 2025
Figure 1 for Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Figure 2 for Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Figure 3 for Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Figure 4 for Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Viaarxiv icon

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Add code
Dec 12, 2024
Figure 1 for InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Figure 2 for InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Figure 3 for InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Figure 4 for InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Viaarxiv icon