Picture for Jiashi Feng

Jiashi Feng

NUS

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Add code
Jan 16, 2025
Viaarxiv icon

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Add code
Jan 07, 2025
Figure 1 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Figure 2 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Figure 3 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Figure 4 for Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Viaarxiv icon

Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

Add code
Dec 24, 2024
Viaarxiv icon

Parallelized Autoregressive Visual Generation

Add code
Dec 19, 2024
Viaarxiv icon

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Add code
Dec 18, 2024
Viaarxiv icon

Image Understanding Makes for A Good Tokenizer for Image Generation

Add code
Nov 07, 2024
Viaarxiv icon

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Add code
Nov 04, 2024
Figure 1 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 2 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 3 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 4 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Viaarxiv icon

How Far is Video Generation from World Model: A Physical Law Perspective

Add code
Nov 04, 2024
Figure 1 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 2 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 3 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 4 for How Far is Video Generation from World Model: A Physical Law Perspective
Viaarxiv icon

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Add code
Oct 14, 2024
Figure 1 for LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Figure 2 for LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Figure 3 for LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Figure 4 for LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Viaarxiv icon

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Add code
Oct 03, 2024
Viaarxiv icon