Picture for Juan Carlos Niebles

Juan Carlos Niebles

Artificial Intelligence Index Report 2025

Add code
Apr 08, 2025
Viaarxiv icon

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Add code
Apr 08, 2025
Viaarxiv icon

Re-thinking Temporal Search for Long-Form Video Understanding

Add code
Apr 03, 2025
Viaarxiv icon

ActionStudio: A Lightweight Framework for Data and Training of Large Action Models

Add code
Mar 31, 2025
Viaarxiv icon

SocialGen: Modeling Multi-Human Social Interaction with Language Models

Add code
Mar 28, 2025
Viaarxiv icon

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Add code
Mar 04, 2025
Viaarxiv icon

Unifying Specialized Visual Encoders for Video Language Models

Add code
Jan 02, 2025
Figure 1 for Unifying Specialized Visual Encoders for Video Language Models
Figure 2 for Unifying Specialized Visual Encoders for Video Language Models
Figure 3 for Unifying Specialized Visual Encoders for Video Language Models
Figure 4 for Unifying Specialized Visual Encoders for Video Language Models
Viaarxiv icon

ViUniT: Visual Unit Tests for More Robust Visual Programming

Add code
Dec 12, 2024
Figure 1 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Figure 2 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Figure 3 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Figure 4 for ViUniT: Visual Unit Tests for More Robust Visual Programming
Viaarxiv icon

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Add code
Dec 10, 2024
Figure 1 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 2 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 3 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Figure 4 for TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Viaarxiv icon

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Add code
Dec 09, 2024
Viaarxiv icon