Picture for Jianwei Yang

Jianwei Yang

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Add code
Jan 09, 2025
Figure 1 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 2 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 3 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 4 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Viaarxiv icon

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Add code
Dec 17, 2024
Viaarxiv icon

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Add code
Dec 13, 2024
Viaarxiv icon

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Add code
Dec 12, 2024
Viaarxiv icon

Mojito: Motion Trajectory and Intensity Control for Video Generation

Add code
Dec 12, 2024
Figure 1 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 2 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 3 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 4 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Viaarxiv icon

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Add code
Dec 05, 2024
Figure 1 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 2 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 3 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 4 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Viaarxiv icon

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Add code
Oct 15, 2024
Figure 1 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 2 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 3 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 4 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Viaarxiv icon

Latent Action Pretraining from Videos

Add code
Oct 15, 2024
Figure 1 for Latent Action Pretraining from Videos
Figure 2 for Latent Action Pretraining from Videos
Figure 3 for Latent Action Pretraining from Videos
Figure 4 for Latent Action Pretraining from Videos
Viaarxiv icon

Towards Flexible Visual Relationship Segmentation

Add code
Aug 15, 2024
Viaarxiv icon

OmniParser for Pure Vision Based GUI Agent

Add code
Aug 01, 2024
Figure 1 for OmniParser for Pure Vision Based GUI Agent
Figure 2 for OmniParser for Pure Vision Based GUI Agent
Figure 3 for OmniParser for Pure Vision Based GUI Agent
Figure 4 for OmniParser for Pure Vision Based GUI Agent
Viaarxiv icon