Picture for Haoji Hu

Haoji Hu

Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Add code
Feb 02, 2026
Viaarxiv icon

Learn Before Represent: Bridging Generative and Contrastive Learning for Domain-Specific LLM Embeddings

Add code
Jan 16, 2026
Viaarxiv icon

SemanticGen: Video Generation in Semantic Space

Add code
Dec 24, 2025
Figure 1 for SemanticGen: Video Generation in Semantic Space
Figure 2 for SemanticGen: Video Generation in Semantic Space
Figure 3 for SemanticGen: Video Generation in Semantic Space
Figure 4 for SemanticGen: Video Generation in Semantic Space
Viaarxiv icon

Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

Add code
Nov 17, 2025
Viaarxiv icon

No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection

Add code
Aug 24, 2025
Viaarxiv icon

Orientation Matters: Making 3D Generative Models Orientation-Aligned

Add code
Jun 10, 2025
Viaarxiv icon

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Add code
Mar 14, 2025
Viaarxiv icon

Dynamic Token Reduction during Generation for Vision Language Models

Add code
Jan 24, 2025
Figure 1 for Dynamic Token Reduction during Generation for Vision Language Models
Figure 2 for Dynamic Token Reduction during Generation for Vision Language Models
Figure 3 for Dynamic Token Reduction during Generation for Vision Language Models
Figure 4 for Dynamic Token Reduction during Generation for Vision Language Models
Viaarxiv icon

ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

Add code
Dec 28, 2024
Figure 1 for ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Figure 2 for ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Figure 3 for ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Figure 4 for ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Viaarxiv icon

Enhancing Facial Consistency in Conditional Video Generation via Facial Landmark Transformation

Add code
Dec 12, 2024
Figure 1 for Enhancing Facial Consistency in Conditional Video Generation via Facial Landmark Transformation
Figure 2 for Enhancing Facial Consistency in Conditional Video Generation via Facial Landmark Transformation
Viaarxiv icon