Picture for Jianwei Yang

Jianwei Yang

Magma: A Foundation Model for Multimodal AI Agents

Add code
Feb 18, 2025
Viaarxiv icon

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Add code
Jan 09, 2025
Figure 1 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 2 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 3 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 4 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Viaarxiv icon

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Add code
Dec 17, 2024
Viaarxiv icon

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Add code
Dec 13, 2024
Viaarxiv icon

Mojito: Motion Trajectory and Intensity Control for Video Generation

Add code
Dec 12, 2024
Figure 1 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 2 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 3 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 4 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Viaarxiv icon

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Add code
Dec 12, 2024
Viaarxiv icon

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Add code
Dec 05, 2024
Figure 1 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 2 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 3 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 4 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Viaarxiv icon

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Add code
Oct 15, 2024
Figure 1 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 2 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 3 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 4 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Viaarxiv icon

Latent Action Pretraining from Videos

Add code
Oct 15, 2024
Figure 1 for Latent Action Pretraining from Videos
Figure 2 for Latent Action Pretraining from Videos
Figure 3 for Latent Action Pretraining from Videos
Figure 4 for Latent Action Pretraining from Videos
Viaarxiv icon

Towards Flexible Visual Relationship Segmentation

Add code
Aug 15, 2024
Viaarxiv icon