Picture for Jianhua Han

Jianhua Han

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Add code
Mar 09, 2025
Viaarxiv icon

Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

Add code
Mar 08, 2025
Viaarxiv icon

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

Add code
Feb 21, 2025
Viaarxiv icon

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Add code
Dec 09, 2024
Viaarxiv icon

AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

Add code
Nov 18, 2024
Viaarxiv icon

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Add code
Nov 14, 2024
Viaarxiv icon

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Add code
Sep 26, 2024
Figure 1 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 2 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 3 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 4 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Viaarxiv icon

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Add code
Sep 06, 2024
Figure 1 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Figure 2 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Figure 3 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Figure 4 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Viaarxiv icon

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

Add code
Aug 23, 2024
Figure 1 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Figure 2 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Figure 3 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Figure 4 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Viaarxiv icon

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Add code
Jul 11, 2024
Figure 1 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Figure 2 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Figure 3 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Figure 4 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Viaarxiv icon