Picture for Juan Carlos Niebles

Juan Carlos Niebles

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Add code
Mar 04, 2025
Viaarxiv icon

Unifying Specialized Visual Encoders for Video Language Models

Add code
Jan 02, 2025
Figure 1 for Unifying Specialized Visual Encoders for Video Language Models
Figure 2 for Unifying Specialized Visual Encoders for Video Language Models
Figure 3 for Unifying Specialized Visual Encoders for Video Language Models
Figure 4 for Unifying Specialized Visual Encoders for Video Language Models
Viaarxiv icon

ViUniT: Visual Unit Tests for More Robust Visual Programming

Add code
Dec 12, 2024
Figure 1 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Figure 2 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Figure 3 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Figure 4 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Viaarxiv icon

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Add code
Dec 10, 2024
Figure 1 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 2 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 3 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 4 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Viaarxiv icon

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Add code
Dec 09, 2024
Viaarxiv icon

Streaming Detection of Queried Event Start

Add code
Dec 04, 2024
Figure 1 for Streaming Detection of Queried Event Start
Figure 2 for Streaming Detection of Queried Event Start
Figure 3 for Streaming Detection of Queried Event Start
Figure 4 for Streaming Detection of Queried Event Start
Viaarxiv icon

SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Add code
Nov 20, 2024
Figure 1 for SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Figure 2 for SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Figure 3 for SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Figure 4 for SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Viaarxiv icon

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Add code
Nov 18, 2024
Viaarxiv icon

PRACT: Optimizing Principled Reasoning and Acting of LLM Agent

Add code
Oct 24, 2024
Figure 1 for PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Figure 2 for PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Figure 3 for PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Figure 4 for PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Viaarxiv icon

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Add code
Oct 21, 2024
Figure 1 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 2 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 3 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 4 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Viaarxiv icon