Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chaithanya Kumar Mummadi

Text-driven Prompt Generation for Vision-Language Models in Federated Learning

Oct 09, 2023

Chen Qiu, Xingyu Li, Chaithanya Kumar Mummadi, Madan Ravi Ganesh, Zhenzhen Li, Lu Peng, Wan-Yi Lin

Figure 1 for Text-driven Prompt Generation for Vision-Language Models in Federated Learning

Figure 2 for Text-driven Prompt Generation for Vision-Language Models in Federated Learning

Figure 3 for Text-driven Prompt Generation for Vision-Language Models in Federated Learning

Figure 4 for Text-driven Prompt Generation for Vision-Language Models in Federated Learning

Abstract:Prompt learning for vision-language models, e.g., CoOp, has shown great success in adapting CLIP to different downstream tasks, making it a promising solution for federated learning due to computational reasons. Existing prompt learning techniques replace hand-crafted text prompts with learned vectors that offer improvements on seen classes, but struggle to generalize to unseen classes. Our work addresses this challenge by proposing Federated Text-driven Prompt Generation (FedTPG), which learns a unified prompt generation network across multiple remote clients in a scalable manner. The prompt generation network is conditioned on task-related text input, thus is context-aware, making it suitable to generalize for both seen and unseen classes. Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods, that achieve overall better generalization on both seen and unseen classes and is also generalizable to unseen datasets.

Via

Access Paper or Ask Questions

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Sep 29, 2023

Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi

Abstract:Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP tunes per-image weights to each prompt template at inference time, based on statistics of class descriptor-image similarities. AutoCLIP is fully unsupervised, has very low computational overhead, and can be easily implemented in few lines of code. We show that AutoCLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy.

Via

Access Paper or Ask Questions

Zero-Shot Visual Classification with Guided Cropping

Sep 12, 2023

Piyapat Saranrittichai, Mauricio Munoz, Volker Fischer, Chaithanya Kumar Mummadi

Abstract:Pretrained vision-language models, such as CLIP, show promising zero-shot performance across a wide variety of datasets. For closed-set classification tasks, however, there is an inherent limitation: CLIP image encoders are typically designed to extract generic image-level features that summarize superfluous or confounding information for the target tasks. This results in degradation of classification performance, especially when objects of interest cover small areas of input images. In this work, we propose CLIP with Guided Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest and minimize influence of extraneous image regions. We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small objects.

Via

Access Paper or Ask Questions

More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

Aug 02, 2023

Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, Furong Huang

Figure 1 for More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

Figure 2 for More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

Figure 3 for More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

Figure 4 for More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

Abstract:CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process: a modern neuroscience view suggests that in classifying an object, humans first infer its class-independent attributes (e.g., background and orientation) which help separate the foreground object from the background, and then make decisions based on this information. Inspired by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.

Via

Access Paper or Ask Questions

Revisiting Image Classifier Training for Improved Certified Robust Defense against Adversarial Patches

Jun 22, 2023

Aniruddha Saha, Shuhua Yu, Arash Norouzzadeh, Wan-Yi Lin, Chaithanya Kumar Mummadi

Abstract:Certifiably robust defenses against adversarial patches for image classifiers ensure correct prediction against any changes to a constrained neighborhood of pixels. PatchCleanser arXiv:2108.09135 [cs.CV], the state-of-the-art certified defense, uses a double-masking strategy for robust classification. The success of this strategy relies heavily on the model's invariance to image pixel masking. In this paper, we take a closer look at model training schemes to improve this invariance. Instead of using Random Cutout arXiv:1708.04552v2 [cs.CV] augmentations like PatchCleanser, we introduce the notion of worst-case masking, i.e., selecting masked images which maximize classification loss. However, finding worst-case masks requires an exhaustive search, which might be prohibitively expensive to do on-the-fly during training. To solve this problem, we propose a two-round greedy masking strategy (Greedy Cutout) which finds an approximate worst-case mask location with much less compute. We show that the models trained with our Greedy Cutout improves certified robust accuracy over Random Cutout in PatchCleanser across a range of datasets and architectures. Certified robust accuracy on ImageNet with a ViT-B16-224 model increases from 58.1\% to 62.3\% against a 3\% square patch applied anywhere on the image.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

Multi-Attribute Open Set Recognition

Aug 14, 2022

Piyapat Saranrittichai, Chaithanya Kumar Mummadi, Claudia Blaiotta, Mauricio Munoz, Volker Fischer

Figure 1 for Multi-Attribute Open Set Recognition

Figure 2 for Multi-Attribute Open Set Recognition

Figure 3 for Multi-Attribute Open Set Recognition

Figure 4 for Multi-Attribute Open Set Recognition

Abstract:Open Set Recognition (OSR) extends image classification to an open-world setting, by simultaneously classifying known classes and identifying unknown ones. While conventional OSR approaches can detect Out-of-Distribution (OOD) samples, they cannot provide explanations indicating which underlying visual attribute(s) (e.g., shape, color or background) cause a specific sample to be unknown. In this work, we introduce a novel problem setup that generalizes conventional OSR to a multi-attribute setting, where multiple visual attributes are simultaneously recognized. Here, OOD samples can be not only identified but also categorized by their unknown attribute(s). We propose simple extensions of common OSR baselines to handle this novel scenario. We show that these baselines are vulnerable to shortcuts when spurious correlations exist in the training dataset. This leads to poor OOD performance which, according to our experiments, is mainly due to unintended cross-attribute correlations of the predicted confidence scores. We provide an empirical evidence showing that this behavior is consistent across different baselines on both synthetic and real world datasets.

* Accepted for publication at German Conference for Pattern Recognition (GCPR) 2022

Via

Access Paper or Ask Questions

Overcoming Shortcut Learning in a Target Domain by Generalizing Basic Visual Factors from a Source Domain

Jul 20, 2022

Piyapat Saranrittichai, Chaithanya Kumar Mummadi, Claudia Blaiotta, Mauricio Munoz, Volker Fischer

Figure 1 for Overcoming Shortcut Learning in a Target Domain by Generalizing Basic Visual Factors from a Source Domain

Figure 2 for Overcoming Shortcut Learning in a Target Domain by Generalizing Basic Visual Factors from a Source Domain

Figure 3 for Overcoming Shortcut Learning in a Target Domain by Generalizing Basic Visual Factors from a Source Domain

Figure 4 for Overcoming Shortcut Learning in a Target Domain by Generalizing Basic Visual Factors from a Source Domain

Abstract:Shortcut learning occurs when a deep neural network overly relies on spurious correlations in the training dataset in order to solve downstream tasks. Prior works have shown how this impairs the compositional generalization capability of deep learning models. To address this problem, we propose a novel approach to mitigate shortcut learning in uncontrolled target domains. Our approach extends the training set with an additional dataset (the source domain), which is specifically designed to facilitate learning independent representations of basic visual factors. We benchmark our idea on synthetic target domains where we explicitly control shortcut opportunities as well as real-world target domains. Furthermore, we analyze the effect of different specifications of the source domain and the network architecture on compositional generalization. Our main finding is that leveraging data from a source domain is an effective way to mitigate shortcut learning. By promoting independence across different factors of variation in the learned representations, networks can learn to consider only predictive factors and ignore potential shortcut factors during inference.

* Accepted for publication at European Conference on Computer Vision (ECCV) 2022

Via

Access Paper or Ask Questions

Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness

Mar 25, 2022

Giulio Lovisotto, Nicole Finnie, Mauricio Munoz, Chaithanya Kumar Mummadi, Jan Hendrik Metzen

Figure 1 for Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness

Figure 2 for Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness

Figure 3 for Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness

Figure 4 for Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness

Abstract:Neural architectures based on attention such as vision transformers are revolutionizing image recognition. Their main benefit is that attention allows reasoning about all parts of a scene jointly. In this paper, we show how the global reasoning of (scaled) dot-product attention can be the source of a major vulnerability when confronted with adversarial patch attacks. We provide a theoretical understanding of this vulnerability and relate it to an adversary's ability to misdirect the attention of all queries to a single key token under the control of the adversarial patch. We propose novel adversarial objectives for crafting adversarial patches which target this vulnerability explicitly. We show the effectiveness of the proposed patch attacks on popular image classification (ViTs and DeiTs) and object detection models (DETR). We find that adversarial patches occupying 0.5% of the input can lead to robust accuracies as low as 0% for ViT on ImageNet, and reduce the mAP of DETR on MS COCO to less than 3%.

* to be published in IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, CVPR22

Via

Access Paper or Ask Questions

DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities

Aug 12, 2021

Elias Eulig, Piyapat Saranrittichai, Chaithanya Kumar Mummadi, Kilian Rambach, William Beluch, Xiahan Shi, Volker Fischer

Figure 1 for DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities

Figure 2 for DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities

Figure 3 for DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities

Figure 4 for DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities

Abstract:Common deep neural networks (DNNs) for image classification have been shown to rely on shortcut opportunities (SO) in the form of predictive and easy-to-represent visual factors. This is known as shortcut learning and leads to impaired generalization. In this work, we show that common DNNs also suffer from shortcut learning when predicting only basic visual object factors of variation (FoV) such as shape, color, or texture. We argue that besides shortcut opportunities, generalization opportunities (GO) are also an inherent part of real-world vision data and arise from partial independence between predicted classes and FoVs. We also argue that it is necessary for DNNs to exploit GO to overcome shortcut learning. Our core contribution is to introduce the Diagnostic Vision Benchmark suite DiagViB-6, which includes datasets and metrics to study a network's shortcut vulnerability and generalization capability for six independent FoV. In particular, DiagViB-6 allows controlling the type and degree of SO and GO in a dataset. We benchmark a wide range of popular vision architectures and show that they can exploit GO only to a limited extent.

* Accepted for publication at IEEE International Conference on Computer Vision (ICCV) 2021

Via

Access Paper or Ask Questions

Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

Jun 28, 2021

Chaithanya Kumar Mummadi, Robin Hutmacher, Kilian Rambach, Evgeny Levinkov, Thomas Brox, Jan Hendrik Metzen

Figure 1 for Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

Figure 2 for Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

Figure 3 for Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

Figure 4 for Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

Abstract:Deep neural networks often exhibit poor performance on data that is unlikely under the train-time data distribution, for instance data affected by corruptions. Previous works demonstrate that test-time adaptation to data shift, for instance using entropy minimization, effectively improves performance on such shifted distributions. This paper focuses on the fully test-time adaptation setting, where only unlabeled data from the target distribution is required. This allows adapting arbitrary pretrained networks. Specifically, we propose a novel loss that improves test-time adaptation by addressing both premature convergence and instability of entropy minimization. This is achieved by replacing the entropy by a non-saturating surrogate and adding a diversity regularizer based on batch-wise entropy maximization that prevents convergence to trivial collapsed solutions. Moreover, we propose to prepend an input transformation module to the network that can partially undo test-time distribution shifts. Surprisingly, this preprocessing can be learned solely using the fully test-time adaptation loss in an end-to-end fashion without any target domain labels or source domain data. We show that our approach outperforms previous work in improving the robustness of publicly available pretrained image classifiers to common corruptions on such challenging benchmarks as ImageNet-C.

* 16 pages, 5 figures, 7 tables

Via

Access Paper or Ask Questions