Abstract:Evaluating the reasoning capabilities of Vision-Language Models (VLMs) in complex visual tasks provides valuable insights into their potential and limitations. In this work, we assess the performance of VLMs on the challenging Bongard Openworld Problems benchmark, which involves reasoning over natural images. We propose and evaluate three human-inspired paradigms: holistic analysis (global context processing), deductive rule learning (explicit rule derivation and application), and componential analysis (structured decomposition of images into components). Our results demonstrate that state-of-the-art models, including GPT-4o and Gemini, not only surpass human benchmarks but also excel in structured reasoning tasks, with componential analysis proving especially effective. However, ablation studies reveal key challenges, such as handling synthetic images, making fine-grained distinctions, and interpreting nuanced contextual information. These insights underscore the need for further advancements in model robustness and generalization, while highlighting the transformative potential of structured reasoning approaches in enhancing VLM capabilities.
Abstract:Commonsense reasoning has long been considered as one of the holy grails of artificial intelligence. Most of the recent progress in the field has been achieved by novel machine learning algorithms for natural language processing. However, without incorporating logical reasoning, these algorithms remain arguably shallow. With some notable exceptions, developers of practical automated logic-based reasoners have mostly avoided focusing on the problem. The paper argues that the methods and algorithms used by existing automated reasoners for classical first-order logic can be extended towards commonsense reasoning. Instead of devising new specialized logics we propose a framework of extensions to the mainstream resolution-based search methods to make these capable of performing search tasks for practical commonsense reasoning with reasonable efficiency. The proposed extensions mostly rely on operating on ordinary proof trees and are devised to handle commonsense knowledge bases containing inconsistencies, default rules, taxonomies, topics, relevance, confidence and similarity measures. We claim that machine learning is best suited for the construction of commonsense knowledge bases while the extended logic-based methods would be well-suited for actually answering queries from these knowledge bases.