Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Amit Roy-Chowdhury

When and How Unlabeled Data Provably Improve In-Context Learning

Jun 18, 2025

Yingcong Li, Xiangyu Chang, Muti Kara, Xiaofeng Liu, Amit Roy-Chowdhury, Samet Oymak

Abstract:Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.

Via

Access Paper or Ask Questions

A Certified Unlearning Approach without Access to Source Data

Jun 06, 2025

Umit Yigit Basaran, Sk Miraj Ahmed, Amit Roy-Chowdhury, Basak Guler

Abstract:With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.

* Accepted by ICML 2025

Via

Access Paper or Ask Questions

Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Feb 03, 2025

Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury

Figure 1 for Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Figure 2 for Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Figure 3 for Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Figure 4 for Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Abstract:Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.

Via

Access Paper or Ask Questions

Selective Attention: Enhancing Transformer through Principled Context Control

Nov 19, 2024

Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi Chen, Samet Oymak

Figure 1 for Selective Attention: Enhancing Transformer through Principled Context Control

Figure 2 for Selective Attention: Enhancing Transformer through Principled Context Control

Figure 3 for Selective Attention: Enhancing Transformer through Principled Context Control

Figure 4 for Selective Attention: Enhancing Transformer through Principled Context Control

Abstract:The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same way by applying the mapping $V^\top\text{softmax}(Kq)$, where $V,K$ are the value and key embeddings respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. As a solution, we introduce the $\textit{Selective Self-Attention}$ (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. By controlling temperature, SSA adapts the contextual sparsity of the attention map to the query embedding and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model's ability to control softmax spikiness of individual queries. We also incorporate temperature scaling for value embeddings and show that it boosts the model's ability to suppress irrelevant/noisy tokens. Notably, SSA is a lightweight method which introduces less than 0.5% new parameters through a weight-sharing strategy and can be fine-tuned on existing LLMs. Extensive empirical evaluations demonstrate that SSA-equipped models achieve a noticeable and consistent accuracy improvement on language modeling benchmarks.

Via

Access Paper or Ask Questions

Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior

Oct 18, 2024

Calvin-Khang Ta, Arindam Dutta, Rohit Kundu, Rohit Lal, Hannah Dela Cruz, Dripta S. Raychaudhuri, Amit Roy-Chowdhury

Abstract:The Skinned Multi-Person Linear (SMPL) model plays a crucial role in 3D human pose estimation, providing a streamlined yet effective representation of the human body. However, ensuring the validity of SMPL configurations during tasks such as human mesh regression remains a significant challenge , highlighting the necessity for a robust human pose prior capable of discerning realistic human poses. To address this, we introduce MOPED: \underline{M}ulti-m\underline{O}dal \underline{P}os\underline{E} \underline{D}iffuser. MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters. Our method offers powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text. This capability enhances the applicability of our approach by incorporating additional context often overlooked in traditional pose priors. Extensive experiments across three distinct tasks-pose estimation, pose denoising, and pose completion-demonstrate that our multi-modal diffusion model-based prior significantly outperforms existing methods. These results indicate that our model captures a broader spectrum of plausible human poses.

Via

Access Paper or Ask Questions

Vision-based Xylem Wetness Classification in Stem Water Potential Determination

Sep 24, 2024

Pamodya Peiris, Aritra Samanta, Caio Mucchiani, Cody Simons, Amit Roy-Chowdhury, Konstantinos Karydis

Figure 1 for Vision-based Xylem Wetness Classification in Stem Water Potential Determination

Figure 2 for Vision-based Xylem Wetness Classification in Stem Water Potential Determination

Figure 3 for Vision-based Xylem Wetness Classification in Stem Water Potential Determination

Figure 4 for Vision-based Xylem Wetness Classification in Stem Water Potential Determination

Abstract:Water is often overused in irrigation, making efficient management of it crucial. Precision Agriculture emphasizes tools like stem water potential (SWP) analysis for better plant status determination. However, such tools often require labor-intensive in-situ sampling. Automation and machine learning can streamline this process and enhance outcomes. This work focused on automating stem detection and xylem wetness classification using the Scholander Pressure Chamber, a widely used but demanding method for SWP measurement. The aim was to refine stem detection and develop computer-vision-based methods to better classify water emergence at the xylem. To this end, we collected and manually annotated video data, applying vision- and learning-based methods for detection and classification. Additionally, we explored data augmentation and fine-tuned parameters to identify the most effective models. The identified best-performing models for stem detection and xylem wetness classification were evaluated end-to-end over 20 SWP measurements. Learning-based stem detection via YOLOv8n combined with ResNet50-based classification achieved a Top-1 accuracy of 80.98%, making it the best-performing approach for xylem wetness classification.

Via

Access Paper or Ask Questions

Efficient Transformer Encoders for Mask2Former-style models

Apr 23, 2024

Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker

Figure 1 for Efficient Transformer Encoders for Mask2Former-style models

Figure 2 for Efficient Transformer Encoders for Mask2Former-style models

Figure 3 for Efficient Transformer Encoders for Mask2Former-style models

Figure 4 for Efficient Transformer Encoders for Mask2Former-style models

Abstract:Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.

Via

Access Paper or Ask Questions

Blackbox Attacks via Surrogate Ensemble Search

Aug 07, 2022

Zikui Cai, Chengyu Song, Srikanth Krishnamurthy, Amit Roy-Chowdhury, M. Salman Asif

Figure 1 for Blackbox Attacks via Surrogate Ensemble Search

Figure 2 for Blackbox Attacks via Surrogate Ensemble Search

Figure 3 for Blackbox Attacks via Surrogate Ensemble Search

Figure 4 for Blackbox Attacks via Surrogate Ensemble Search

Abstract:Blackbox adversarial attacks can be categorized into transfer- and query-based attacks. Transfer methods do not require any feedback from the victim model, but provide lower success rates compared to query-based methods. Query attacks often require a large number of queries for success. To achieve the best of both approaches, recent efforts have tried to combine them, but still require hundreds of queries to achieve high success rates (especially for targeted attacks). In this paper, we propose a novel method for blackbox attacks via surrogate ensemble search (BASES) that can generate highly successful blackbox attacks using an extremely small number of queries. We first define a perturbation machine that generates a perturbed image by minimizing a weighted loss function over a fixed set of surrogate models. To generate an attack for a given victim model, we search over the weights in the loss function using queries generated by the perturbation machine. Since the dimension of the search space is small (same as the number of surrogate models), the search requires a small number of queries. We demonstrate that our proposed method achieves better success rate with at least 30x fewer queries compared to state-of-the-art methods on different image classifiers trained with ImageNet (including VGG-19, DenseNet-121, and ResNext-50). In particular, our method requires as few as 3 queries per image (on average) to achieve more than a 90% success rate for targeted attacks and 1-2 queries per image for over a 99% success rate for non-targeted attacks. Our method is also effective on Google Cloud Vision API and achieved a 91% non-targeted attack success rate with 2.9 queries per image. We also show that the perturbations generated by our proposed method are highly transferable and can be adopted for hard-label blackbox attacks.

Via

Access Paper or Ask Questions

Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency

Jul 24, 2020

Shasha Li, Shitong Zhu, Sudipta Paul, Amit Roy-Chowdhury, Chengyu Song, Srikanth Krishnamurthy, Ananthram Swami, Kevin S Chan

Figure 1 for Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency

Figure 2 for Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency

Figure 3 for Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency

Figure 4 for Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency

Abstract:There has been a recent surge in research on adversarial perturbations that defeat Deep Neural Networks (DNNs) in machine vision; most of these perturbation-based attacks target object classifiers. Inspired by the observation that humans are able to recognize objects that appear out of place in a scene or along with other unlikely objects, we augment the DNN with a system that learns context consistency rules during training and checks for the violations of the same during testing. Our approach builds a set of auto-encoders, one for each object class, appropriately trained so as to output a discrepancy between the input and output if an added adversarial perturbation violates context consistency rules. Experiments on PASCAL VOC and MS COCO show that our method effectively detects various adversarial attacks and achieves high ROC-AUC (over 0.95 in most cases); this corresponds to over 20% improvement over a state-of-the-art context-agnostic method.

* The paper is accepted by ECCV 2020

Via

Access Paper or Ask Questions

Adaptive Algorithm and Platform Selection for Visual Detection and Tracking

May 21, 2016

Shu Zhang, Qi Zhu, Amit Roy-Chowdhury

Figure 1 for Adaptive Algorithm and Platform Selection for Visual Detection and Tracking

Figure 2 for Adaptive Algorithm and Platform Selection for Visual Detection and Tracking

Figure 3 for Adaptive Algorithm and Platform Selection for Visual Detection and Tracking

Figure 4 for Adaptive Algorithm and Platform Selection for Visual Detection and Tracking

Abstract:Computer vision algorithms are known to be extremely sensitive to the environmental conditions in which the data is captured, e.g., lighting conditions and target density. Tuning of parameters or choosing a completely new algorithm is often needed to achieve a certain performance level, especially when there is a limitation of the computation source. In this paper, we focus on this problem and propose a framework to adaptively select the "best" algorithm-parameter combination and the computation platform under performance and cost constraints at design time, and adapt the algorithms at runtime based on real-time inputs. This necessitates developing a mechanism to switch between different algorithms as the nature of the input video changes. Our proposed algorithm calculates a similarity function between a test video scenario and each training scenario, where the similarity calculation is based on learning a manifold of image features that is shared by both the training and test datasets. Similarity between training and test dataset indicates the same algorithm can be applied to both of them and achieve similar performance. We design a cost function with this similarity measure to find the most similar training scenario to the test data. The "best" algorithm under a given platform is obtained by selecting the algorithm with a specific parameter combination that performs the best on the corresponding training data. The proposed framework can be used first offline to choose the platform based on performance and cost constraints, and then online whereby the "best" algorithm is selected for each new incoming video segment for a given platform. In the experiments, we apply our algorithm to the problems of pedestrian detection and tracking. We show how to adaptively select platforms and algorithm-parameter combinations. Our results provide optimal performance on 3 publicly available datasets.

* 10 pages, 10 figures

Via

Access Paper or Ask Questions