Picture for Jianhua Han

Jianhua Han

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Add code
Dec 09, 2024
Viaarxiv icon

AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

Add code
Nov 18, 2024
Viaarxiv icon

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Add code
Nov 14, 2024
Viaarxiv icon

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Add code
Sep 26, 2024
Figure 1 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 2 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 3 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 4 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Viaarxiv icon

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Add code
Sep 06, 2024
Figure 1 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Figure 2 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Figure 3 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Figure 4 for UNIT: Unifying Image and Text Recognition in One Vision Encoder
Viaarxiv icon

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

Add code
Aug 23, 2024
Figure 1 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Figure 2 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Figure 3 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Figure 4 for EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
Viaarxiv icon

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Add code
Jul 11, 2024
Figure 1 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Figure 2 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Figure 3 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Figure 4 for HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Viaarxiv icon

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Add code
Jul 09, 2024
Viaarxiv icon

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Add code
Apr 14, 2024
Viaarxiv icon

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Add code
Mar 18, 2024
Viaarxiv icon