Picture for Hanrong Ye

Hanrong Ye

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Add code
Jan 14, 2026
Viaarxiv icon

Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations

Add code
Aug 25, 2025
Figure 1 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Figure 2 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Figure 3 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Figure 4 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Viaarxiv icon

Scaling RL to Long Videos

Add code
Jul 10, 2025
Viaarxiv icon

Multi-Task Label Discovery via Hierarchical Task Tokens for Partially Annotated Dense Predictions

Add code
Nov 27, 2024
Viaarxiv icon

MM-Ego: Towards Building Egocentric Multimodal LLMs

Add code
Oct 09, 2024
Figure 1 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Figure 2 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Figure 3 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Figure 4 for MM-Ego: Towards Building Egocentric Multimodal LLMs
Viaarxiv icon

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Add code
Jul 01, 2024
Figure 1 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Figure 2 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Figure 3 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Figure 4 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Viaarxiv icon

X-VILA: Cross-Modality Alignment for Large Language Model

Add code
May 29, 2024
Figure 1 for X-VILA: Cross-Modality Alignment for Large Language Model
Figure 2 for X-VILA: Cross-Modality Alignment for Large Language Model
Figure 3 for X-VILA: Cross-Modality Alignment for Large Language Model
Figure 4 for X-VILA: Cross-Modality Alignment for Large Language Model
Viaarxiv icon

DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data

Add code
Mar 22, 2024
Figure 1 for DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
Figure 2 for DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
Figure 3 for DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
Figure 4 for DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
Viaarxiv icon

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

Add code
Nov 06, 2023
Viaarxiv icon

TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts

Add code
Jul 28, 2023
Viaarxiv icon