Picture for Zhaokai Wang

Zhaokai Wang

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning

Add code
Oct 21, 2024
Viaarxiv icon

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Add code
Oct 10, 2024
Viaarxiv icon

Parameter-Inverted Image Pyramid Networks

Add code
Jun 06, 2024
Viaarxiv icon

Synergizing Spatial Optimization with Large Language Models for Open-Domain Urban Itinerary Planning

Add code
Feb 11, 2024
Viaarxiv icon

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

Add code
Dec 14, 2023
Viaarxiv icon

Video Background Music Generation: Dataset, Method and Evaluation

Add code
Nov 21, 2022
Viaarxiv icon

Video Background Music Generation with Controllable Music Transformer

Add code
Nov 16, 2021
Figure 1 for Video Background Music Generation with Controllable Music Transformer
Figure 2 for Video Background Music Generation with Controllable Music Transformer
Figure 3 for Video Background Music Generation with Controllable Music Transformer
Figure 4 for Video Background Music Generation with Controllable Music Transformer
Viaarxiv icon

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Add code
Dec 08, 2020
Figure 1 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Figure 2 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Figure 3 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Figure 4 for Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
Viaarxiv icon