Picture for Zhaokai Wang

Zhaokai Wang

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Add code
Dec 12, 2024
Viaarxiv icon

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Add code
Dec 12, 2024
Viaarxiv icon

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning

Add code
Oct 21, 2024
Viaarxiv icon

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Add code
Oct 10, 2024
Figure 1 for Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Figure 2 for Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Figure 3 for Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Figure 4 for Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Viaarxiv icon

Parameter-Inverted Image Pyramid Networks

Add code
Jun 06, 2024
Viaarxiv icon

Synergizing Spatial Optimization with Large Language Models for Open-Domain Urban Itinerary Planning

Add code
Feb 11, 2024
Viaarxiv icon

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

Add code
Dec 14, 2023
Figure 1 for Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
Figure 2 for Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
Figure 3 for Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
Figure 4 for Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
Viaarxiv icon

Video Background Music Generation: Dataset, Method and Evaluation

Add code
Nov 21, 2022
Viaarxiv icon

Video Background Music Generation with Controllable Music Transformer

Add code
Nov 16, 2021
Figure 1 for Video Background Music Generation with Controllable Music Transformer
Figure 2 for Video Background Music Generation with Controllable Music Transformer
Figure 3 for Video Background Music Generation with Controllable Music Transformer
Figure 4 for Video Background Music Generation with Controllable Music Transformer
Viaarxiv icon

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Add code
Dec 08, 2020
Figure 1 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Figure 2 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Figure 3 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Figure 4 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Viaarxiv icon