Picture for Yixiao Ge

Yixiao Ge

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Add code
Dec 05, 2024
Viaarxiv icon

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

Add code
Dec 05, 2024
Figure 1 for Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
Figure 2 for Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
Figure 3 for Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
Figure 4 for Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
Viaarxiv icon

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Add code
Dec 05, 2024
Figure 1 for EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Figure 2 for EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Figure 3 for EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Figure 4 for EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Viaarxiv icon

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Add code
Dec 05, 2024
Viaarxiv icon

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Add code
Dec 03, 2024
Viaarxiv icon

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Add code
Nov 30, 2024
Viaarxiv icon

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Add code
Nov 05, 2024
Figure 1 for PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Figure 2 for PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Figure 3 for PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Figure 4 for PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Viaarxiv icon

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Add code
Sep 06, 2024
Viaarxiv icon

SEED-Story: Multimodal Long Story Generation with Large Language Model

Add code
Jul 11, 2024
Viaarxiv icon

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Add code
Jun 18, 2024
Viaarxiv icon