Picture for Xiaojian Ma

Xiaojian Ma

LongViTU: Instruction Tuning for Long-Form Video Understanding

Add code
Jan 09, 2025
Viaarxiv icon

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Add code
Dec 31, 2024
Figure 1 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 2 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 3 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 4 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Viaarxiv icon

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Add code
Dec 20, 2024
Viaarxiv icon

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Add code
Oct 23, 2024
Viaarxiv icon

Multi-modal Situated Reasoning in 3D Scenes

Add code
Sep 04, 2024
Figure 1 for Multi-modal Situated Reasoning in 3D Scenes
Figure 2 for Multi-modal Situated Reasoning in 3D Scenes
Figure 3 for Multi-modal Situated Reasoning in 3D Scenes
Figure 4 for Multi-modal Situated Reasoning in 3D Scenes
Viaarxiv icon

Task-oriented Sequential Grounding in 3D Scenes

Add code
Aug 07, 2024
Figure 1 for Task-oriented Sequential Grounding in 3D Scenes
Figure 2 for Task-oriented Sequential Grounding in 3D Scenes
Figure 3 for Task-oriented Sequential Grounding in 3D Scenes
Figure 4 for Task-oriented Sequential Grounding in 3D Scenes
Viaarxiv icon

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Add code
Jul 07, 2024
Viaarxiv icon

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Add code
Jun 27, 2024
Figure 1 for OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Figure 2 for OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Figure 3 for OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Figure 4 for OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Viaarxiv icon

Latent Energy-Based Odyssey: Black-Box Optimization via Expanded Exploration in the Energy-Based Latent Space

Add code
May 27, 2024
Viaarxiv icon

Unifying 3D Vision-Language Understanding via Promptable Queries

Add code
May 19, 2024
Figure 1 for Unifying 3D Vision-Language Understanding via Promptable Queries
Figure 2 for Unifying 3D Vision-Language Understanding via Promptable Queries
Figure 3 for Unifying 3D Vision-Language Understanding via Promptable Queries
Figure 4 for Unifying 3D Vision-Language Understanding via Promptable Queries
Viaarxiv icon