Picture for Fengyun Rao

Fengyun Rao

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Add code
Mar 13, 2025
Viaarxiv icon

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Add code
Mar 09, 2025
Viaarxiv icon

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Add code
Mar 04, 2025
Viaarxiv icon

Number it: Temporal Grounding Videos like Flipping Manga

Add code
Nov 15, 2024
Viaarxiv icon

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Add code
Oct 15, 2024
Viaarxiv icon

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Add code
Aug 21, 2024
Figure 1 for EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Figure 2 for EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Figure 3 for EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Figure 4 for EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Viaarxiv icon

Visual Perception by Large Language Model's Weights

Add code
May 30, 2024
Figure 1 for Visual Perception by Large Language Model's Weights
Figure 2 for Visual Perception by Large Language Model's Weights
Figure 3 for Visual Perception by Large Language Model's Weights
Figure 4 for Visual Perception by Large Language Model's Weights
Viaarxiv icon

Multi-Modal Generative Embedding Model

Add code
May 29, 2024
Viaarxiv icon

ReGenNet: Towards Human Action-Reaction Synthesis

Add code
Mar 18, 2024
Viaarxiv icon

Spatial-Semantic Collaborative Cropping for User Generated Content

Add code
Jan 16, 2024
Viaarxiv icon