Picture for Kaipeng Zhang

Kaipeng Zhang

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

Add code
Mar 10, 2025
Viaarxiv icon

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Add code
Mar 09, 2025
Viaarxiv icon

ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Add code
Mar 09, 2025
Viaarxiv icon

Enhance-A-Video: Better Generated Video for Free

Add code
Feb 11, 2025
Viaarxiv icon

SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement

Add code
Feb 10, 2025
Figure 1 for SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement
Figure 2 for SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement
Figure 3 for SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement
Figure 4 for SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement
Viaarxiv icon

ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality

Add code
Dec 05, 2024
Figure 1 for ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
Figure 2 for ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
Figure 3 for ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
Figure 4 for ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
Viaarxiv icon

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Add code
Dec 01, 2024
Viaarxiv icon

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

Add code
Oct 23, 2024
Viaarxiv icon

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

Add code
Oct 11, 2024
Viaarxiv icon

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Add code
Oct 11, 2024
Figure 1 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Figure 2 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Figure 3 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Figure 4 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Viaarxiv icon