Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hunter Thompson

Georgia Institute of Technology

Vision language models are unreliable at trivial spatial cognition

Apr 22, 2025

Sangeet Khemlani, Tyler Tran, Nathaniel Gyory, Anthony M. Harrison, Wallace E. Lawson, Ravenna Thielstrom, Hunter Thompson, Taaren Singh, J. Gregory Trafton

Abstract:Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of related tasks. We sought to test how reliable these architectures are at engaging in trivial spatial cognition, e.g., recognizing whether one object is left of another in an uncluttered scene. We developed a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use logically equivalent descriptions. These analyses suggest limitations in how VLMs may reason about spatial relations in real-world applications. They also reveal novel opportunities for bolstering image caption corpora for more efficient training and testing.

Via

Access Paper or Ask Questions

Object Motion Sensitivity: A Bio-inspired Solution to the Ego-motion Problem for Event-based Cameras

Mar 27, 2023

Shay Snyder, Hunter Thompson, Md Abdullah-Al Kaiser, Gregory Schwartz, Akhilesh Jaiswal, Maryam Parsa

Figure 1 for Object Motion Sensitivity: A Bio-inspired Solution to the Ego-motion Problem for Event-based Cameras

Figure 2 for Object Motion Sensitivity: A Bio-inspired Solution to the Ego-motion Problem for Event-based Cameras

Figure 3 for Object Motion Sensitivity: A Bio-inspired Solution to the Ego-motion Problem for Event-based Cameras

Figure 4 for Object Motion Sensitivity: A Bio-inspired Solution to the Ego-motion Problem for Event-based Cameras

Abstract:Neuromorphic (event-based) image sensors draw inspiration from the human-retina to create an electronic device that can process visual stimuli in a way that closely resembles its biological counterpart. These sensors process information significantly different than the traditional RGB sensors. Specifically, the sensory information generated by event-based image sensors are orders of magnitude sparser compared to that of RGB sensors. The first generation of neuromorphic image sensors, Dynamic Vision Sensor (DVS), are inspired by the computations confined to the photoreceptors and the first retinal synapse. In this work, we highlight the capability of the second generation of neuromorphic image sensors, Integrated Retinal Functionality in CMOS Image Sensors (IRIS), which aims to mimic full retinal computations from photoreceptors to output of the retina (retinal ganglion cells) for targeted feature-extraction. The feature of choice in this work is Object Motion Sensitivity (OMS) that is processed locally in the IRIS sensor. We study the capability of OMS in solving the ego-motion problem of the event-based cameras. Our results show that OMS can accomplish standard computer vision tasks with similar efficiency to conventional RGB and DVS solutions but offers drastic bandwidth reduction. This cuts the wireless and computing power budgets and opens up vast opportunities in high-speed, robust, energy-efficient, and low-bandwidth real-time decision making.

* This document is 9 pages and has 6 figures, tables, and algorithms

Via

Access Paper or Ask Questions