Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peyman M. Kiasari

Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Jun 06, 2025

Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu

Abstract:Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}

Via

Access Paper or Ask Questions

The Master Key Filters Hypothesis: Deep Filters Are General in DS-CNNs

Dec 21, 2024

Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu

Abstract:This paper challenges the prevailing view that convolutional neural network (CNN) filters become increasingly specialized in deeper layers. Motivated by recent observations of clusterable repeating patterns in depthwise separable CNNs (DS-CNNs) trained on ImageNet, we extend this investigation across various domains and datasets. Our analysis of DS-CNNs reveals that deep filters maintain generality, contradicting the expected transition to class-specific filters. We demonstrate the generalizability of these filters through transfer learning experiments, showing that frozen filters from models trained on different datasets perform well and can be further improved when sourced from larger datasets. Our findings indicate that spatial features learned by depthwise separable convolutions remain generic across all layers, domains, and architectures. This research provides new insights into the nature of generalization in neural networks, particularly in DS-CNNs, and has significant implications for transfer learning and model design.

Via

Access Paper or Ask Questions

Unveiling the Unseen: Identifiable Clusters in Trained Depthwise Convolutional Kernels

Jan 25, 2024

Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu

Abstract:Recent advances in depthwise-separable convolutional neural networks (DS-CNNs) have led to novel architectures, that surpass the performance of classical CNNs, by a considerable scalability and accuracy margin. This paper reveals another striking property of DS-CNN architectures: discernible and explainable patterns emerge in their trained depthwise convolutional kernels in all layers. Through an extensive analysis of millions of trained filters, with different sizes and from various models, we employed unsupervised clustering with autoencoders, to categorize these filters. Astonishingly, the patterns converged into a few main clusters, each resembling the difference of Gaussian (DoG) functions, and their first and second-order derivatives. Notably, we were able to classify over 95\% and 90\% of the filters from state-of-the-art ConvNextV2 and ConvNeXt models, respectively. This finding is not merely a technological curiosity; it echoes the foundational models neuroscientists have long proposed for the vision systems of mammals. Our results thus deepen our understanding of the emergent properties of trained DS-CNNs and provide a bridge between artificial and biological visual processing systems. More broadly, they pave the way for more interpretable and biologically-inspired neural network designs in the future.

Via

Access Paper or Ask Questions

Neural Echos: Depthwise Convolutional Filters Replicate Biological Receptive Fields

Jan 18, 2024

Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu

Figure 1 for Neural Echos: Depthwise Convolutional Filters Replicate Biological Receptive Fields

Figure 2 for Neural Echos: Depthwise Convolutional Filters Replicate Biological Receptive Fields

Figure 3 for Neural Echos: Depthwise Convolutional Filters Replicate Biological Receptive Fields

Figure 4 for Neural Echos: Depthwise Convolutional Filters Replicate Biological Receptive Fields

Abstract:In this study, we present evidence suggesting that depthwise convolutional kernels are effectively replicating the structural intricacies of the biological receptive fields observed in the mammalian retina. We provide analytics of trained kernels from various state-of-the-art models substantiating this evidence. Inspired by this intriguing discovery, we propose an initialization scheme that draws inspiration from the biological receptive fields. Experimental analysis of the ImageNet dataset with multiple CNN architectures featuring depthwise convolutions reveals a marked enhancement in the accuracy of the learned model when initialized with biologically derived weights. This underlies the potential for biologically inspired computational models to further our understanding of vision processing systems and to improve the efficacy of convolutional networks.

* Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2024) 8216-8225

Via

Access Paper or Ask Questions