Picture for Jinguo Zhu

Jinguo Zhu

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Add code
Dec 12, 2024
Viaarxiv icon

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Add code
Dec 12, 2024
Viaarxiv icon

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Add code
Dec 06, 2024
Viaarxiv icon

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Add code
Nov 15, 2024
Figure 1 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 2 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 3 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Figure 4 for Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Viaarxiv icon

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Add code
Oct 21, 2024
Figure 1 for Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Figure 2 for Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Figure 3 for Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Figure 4 for Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Viaarxiv icon

Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection

Add code
Jul 27, 2024
Viaarxiv icon

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Add code
Jun 11, 2024
Figure 1 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 2 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 3 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 4 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Viaarxiv icon

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Add code
Apr 22, 2024
Viaarxiv icon

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Add code
Dec 14, 2023
Viaarxiv icon

VLAttack: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Add code
Oct 07, 2023
Viaarxiv icon