Picture for Bingyi Kang

Bingyi Kang

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Add code
Jan 21, 2025
Viaarxiv icon

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Add code
Jan 16, 2025
Viaarxiv icon

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Add code
Dec 18, 2024
Viaarxiv icon

Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Add code
Dec 18, 2024
Viaarxiv icon

Image Understanding Makes for A Good Tokenizer for Image Generation

Add code
Nov 07, 2024
Viaarxiv icon

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Add code
Nov 04, 2024
Figure 1 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 2 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 3 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 4 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Viaarxiv icon

How Far is Video Generation from World Model: A Physical Law Perspective

Add code
Nov 04, 2024
Figure 1 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 2 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 3 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 4 for How Far is Video Generation from World Model: A Physical Law Perspective
Viaarxiv icon

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Add code
Oct 03, 2024
Viaarxiv icon

Depth Anything V2

Add code
Jun 13, 2024
Figure 1 for Depth Anything V2
Figure 2 for Depth Anything V2
Figure 3 for Depth Anything V2
Figure 4 for Depth Anything V2
Viaarxiv icon

Improving Token-Based World Models with Parallel Observation Prediction

Add code
Feb 13, 2024
Viaarxiv icon