Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stefan Stojanov

Weakly-Supervised Learning of Dense Functional Correspondences

Sep 04, 2025

Stefan Stojanov, Linan Zhao, Yunzhi Zhang, Daniel L. K. Yamins, Jiajun Wu

Abstract:Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

* Accepted at ICCV 2025. Project website: https://dense-functional-correspondence.github.io/

Via

Access Paper or Ask Questions

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

May 26, 2025

Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

Abstract:This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.

Via

Access Paper or Ask Questions

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Mar 25, 2025

Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Venkatesh, Kevin Feigelis, Jiajun Wu, Daniel LK Yamins

Figure 1 for Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Figure 2 for Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Figure 3 for Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Figure 4 for Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Abstract:Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

* Project webpage: https://neuroailab.github.io/opt_cwm_page/

Via

Access Paper or Ask Questions

Leveraging Object Priors for Point Tracking

Sep 09, 2024

Bikram Boote, Anh Thai, Wenqi Jia, Ozgur Kara, Stefan Stojanov, James M. Rehg, Sangmin Lee

Figure 1 for Leveraging Object Priors for Point Tracking

Figure 2 for Leveraging Object Priors for Point Tracking

Figure 3 for Leveraging Object Priors for Point Tracking

Figure 4 for Leveraging Object Priors for Point Tracking

Abstract:Point tracking is a fundamental problem in computer vision with numerous applications in AR and robotics. A common failure mode in long-term point tracking occurs when the predicted point leaves the object it belongs to and lands on the background or another object. We identify this as the failure to correctly capture objectness properties in learning to track. To address this limitation of prior work, we propose a novel objectness regularization approach that guides points to be aware of object priors by forcing them to stay inside the the boundaries of object instances. By capturing objectness cues at training time, we avoid the need to compute object masks during testing. In addition, we leverage contextual attention to enhance the feature representation for capturing objectness at the feature level more effectively. As a result, our approach achieves state-of-the-art performance on three point tracking benchmarks, and we further validate the effectiveness of our components via ablation studies. The source code is available at: https://github.com/RehgLab/tracking_objectness

* ECCV 2024 ILR Workshop

Via

Access Paper or Ask Questions

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Jul 12, 2024

Anh Thai, Weiyao Wang, Hao Tang, Stefan Stojanov, Matt Feiszli, James M. Rehg

Figure 1 for 3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Figure 2 for 3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Figure 3 for 3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Figure 4 for 3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Abstract:3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating interesting part label transfer ability across different object categories. Project website: \url{https://ngailapdi.github.io/projects/3by2/}.

* Accepted to ECCV 2024

Via

Access Paper or Ask Questions

The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Jun 14, 2024

Bria Long, Violet Xiang, Stefan Stojanov, Robert Z. Sparks, Zi Yin, Grace E. Keene, Alvin W. M. Tan, Steven Y. Feng, Chengxu Zhuang, Virginia A. Marchman(+2 more)

Abstract:Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children's experiences. Here, we provide the first release of the largest developmental egocentric video dataset to date -- the BabyView dataset -- recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, humanlike AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans?

* 9 pages, 2 figures, 4 tables and SI. Submitted to NeurIPS Datasets and Benchmarks

Via

Access Paper or Ask Questions

ZeroShape: Regression-based Zero-shot Shape Reconstruction

Jan 16, 2024

Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg

Figure 1 for ZeroShape: Regression-based Zero-shot Shape Reconstruction

Figure 2 for ZeroShape: Regression-based Zero-shot Shape Reconstruction

Figure 3 for ZeroShape: Regression-based Zero-shot Shape Reconstruction

Figure 4 for ZeroShape: Regression-based Zero-shot Shape Reconstruction

Abstract:We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time. In contrast, the traditional approach to this problem is regression-based, where deterministic models are trained to directly regress the object shape. Such regression methods possess much higher computational efficiency than generative methods. This raises a natural question: is generative modeling necessary for high performance, or conversely, are regression-based approaches still competitive? To answer this, we design a strong regression-based model, called ZeroShape, based on the converging findings in this field and a novel insight. We also curate a large real-world evaluation benchmark, with objects from three different real-world 3D datasets. This evaluation benchmark is more diverse and an order of magnitude larger than what prior works use to quantitatively evaluate their models, aiming at reducing the evaluation variance in our field. We show that ZeroShape not only achieves superior performance over state-of-the-art methods, but also demonstrates significantly higher computational and data efficiency.

* Project page: https://zixuanh.com/projects/zeroshape.html

Via

Access Paper or Ask Questions

Low-shot Object Learning with Mutual Exclusivity Bias

Dec 06, 2023

Anh Thai, Ahmad Humayun, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

Figure 1 for Low-shot Object Learning with Mutual Exclusivity Bias

Figure 2 for Low-shot Object Learning with Mutual Exclusivity Bias

Figure 3 for Low-shot Object Learning with Mutual Exclusivity Bias

Figure 4 for Low-shot Object Learning with Mutual Exclusivity Bias

Abstract:This paper introduces Low-shot Object Learning with Mutual Exclusivity Bias (LSME), the first computational framing of mutual exclusivity bias, a phenomenon commonly observed in infants during word learning. We provide a novel dataset, comprehensive baselines, and a state-of-the-art method to enable the ML community to tackle this challenging learning task. The goal of LSME is to analyze an RGB image of a scene containing multiple objects and correctly associate a previously-unknown object instance with a provided category label. This association is then used to perform low-shot learning to test category generalization. We provide a data generation pipeline for the LSME problem and conduct a thorough analysis of the factors that contribute to its difficulty. Additionally, we evaluate the performance of multiple baselines, including state-of-the-art foundation models. Finally, we present a baseline approach that outperforms state-of-the-art models in terms of low-shot accuracy.

* Accepted at NeurIPS 2023, Datasets and Benchmarks Track. Project website https://ngailapdi.github.io/projects/lsme/

Via

Access Paper or Ask Questions

ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency

Apr 13, 2023

Zixuan Huang, Varun Jampani, Anh Thai, Yuanzhen Li, Stefan Stojanov, James M. Rehg

Figure 1 for ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency

Figure 2 for ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency

Figure 3 for ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency

Figure 4 for ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency

Abstract:We present ShapeClipper, a novel method that reconstructs 3D object shapes from real-world single-view RGB images. Instead of relying on laborious 3D, multi-view or camera pose annotation, ShapeClipper learns shape reconstruction from a set of single-view segmented images. The key idea is to facilitate shape learning via CLIP-based shape consistency, where we encourage objects with similar CLIP encodings to share similar shapes. We also leverage off-the-shelf normals as an additional geometric constraint so the model can learn better bottom-up reasoning of detailed surface geometry. These two novel consistency constraints, when used to regularize our model, improve its ability to learn both global shape structure and local geometric details. We evaluate our method over three challenging real-world datasets, Pix3D, Pascal3D+, and OpenImages, where we achieve superior performance over state-of-the-art methods.

* Accepted to CVPR 2023, project website at https://zixuanh.com/projects/shapeclipper.html

Via

Access Paper or Ask Questions

Learning Dense Object Descriptors from Multiple Views for Low-shot Category Generalization

Nov 28, 2022

Stefan Stojanov, Anh Thai, Zixuan Huang, James M. Rehg

Abstract:A hallmark of the deep learning era for computer vision is the successful use of large-scale labeled datasets to train feature representations for tasks ranging from object recognition and semantic segmentation to optical flow estimation and novel view synthesis of 3D scenes. In this work, we aim to learn dense discriminative object representations for low-shot category recognition without requiring any category labels. To this end, we propose Deep Object Patch Encodings (DOPE), which can be trained from multiple views of object instances without any category or semantic object part labels. To train DOPE, we assume access to sparse depths, foreground masks and known cameras, to obtain pixel-level correspondences between views of an object, and use this to formulate a self-supervised learning task to learn discriminative object patches. We find that DOPE can directly be used for low-shot classification of novel categories using local-part matching, and is competitive with and outperforms supervised and self-supervised learning baselines. Code and data available at https://github.com/rehg-lab/dope_selfsup.

* Accepted at NeurIPS 2022. Code and data available at https://github.com/rehg-lab/dope_selfsup

Via

Access Paper or Ask Questions