Picture for Yutong Bai

Yutong Bai

Whole-Body Conditioned Egocentric Video Prediction

Add code
Jun 26, 2025
Viaarxiv icon

TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy

Add code
Jun 12, 2025
Viaarxiv icon

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Add code
May 30, 2025
Viaarxiv icon

REOrdering Patches Improves Vision Models

Add code
May 29, 2025
Viaarxiv icon

"I Know It When I See It": Mood Spaces for Connecting and Expressing Visual Concepts

Add code
Apr 21, 2025
Viaarxiv icon

Vector Quantized Feature Fields for Fast 3D Semantic Lifting

Add code
Mar 09, 2025
Viaarxiv icon

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Add code
Dec 03, 2024
Figure 1 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 2 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 3 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 4 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Viaarxiv icon

Analyzing The Language of Visual Tokens

Add code
Nov 07, 2024
Figure 1 for Analyzing The Language of Visual Tokens
Figure 2 for Analyzing The Language of Visual Tokens
Figure 3 for Analyzing The Language of Visual Tokens
Figure 4 for Analyzing The Language of Visual Tokens
Viaarxiv icon

Evaluating Multiview Object Consistency in Humans and Image Models

Add code
Sep 10, 2024
Viaarxiv icon

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

Add code
Jul 25, 2024
Figure 1 for KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Figure 2 for KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Figure 3 for KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Figure 4 for KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Viaarxiv icon