Picture for Jifeng Dai

Jifeng Dai

LangBridge: Interpreting Image as a Combination of Language Embeddings

Add code
Mar 26, 2025
Viaarxiv icon

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Add code
Mar 25, 2025
Viaarxiv icon

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Add code
Mar 13, 2025
Viaarxiv icon

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Add code
Mar 13, 2025
Viaarxiv icon

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

Add code
Mar 03, 2025
Viaarxiv icon

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Add code
Jan 14, 2025
Viaarxiv icon

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Add code
Dec 20, 2024
Viaarxiv icon

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Add code
Dec 12, 2024
Figure 1 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Figure 2 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Figure 3 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Figure 4 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Viaarxiv icon

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Add code
Dec 12, 2024
Figure 1 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 2 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 3 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 4 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Viaarxiv icon

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Add code
Dec 12, 2024
Figure 1 for SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Figure 2 for SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Figure 3 for SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Figure 4 for SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Viaarxiv icon