Picture for Jifeng Dai

Jifeng Dai

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Add code
Jan 14, 2025
Viaarxiv icon

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Add code
Dec 20, 2024
Viaarxiv icon

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Add code
Dec 12, 2024
Figure 1 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Figure 2 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Figure 3 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Figure 4 for V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Viaarxiv icon

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Add code
Dec 12, 2024
Viaarxiv icon

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Add code
Dec 12, 2024
Figure 1 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 2 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 3 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 4 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Viaarxiv icon

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Add code
Dec 06, 2024
Viaarxiv icon

HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Add code
Dec 03, 2024
Figure 1 for HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving
Figure 2 for HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving
Figure 3 for HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving
Figure 4 for HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving
Viaarxiv icon

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Add code
Dec 02, 2024
Figure 1 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Figure 2 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Figure 3 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Figure 4 for MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Viaarxiv icon

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Add code
Nov 15, 2024
Figure 1 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 2 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 3 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 4 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Viaarxiv icon

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

Add code
Oct 22, 2024
Figure 1 for DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
Figure 2 for DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
Figure 3 for DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
Figure 4 for DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
Viaarxiv icon