Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhishek Aich

Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

Sep 15, 2024

Qilong Zhangli, Di Liu, Abhishek Aich, Dimitris Metaxas, Samuel Schulter

Abstract:Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well-defined ground truth with non-overlapping mask layouts and mutually exclusive semantics. However, merging them for multi-dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class "person" in one dataset and class "face" in another will require multilabel handling for certain pixels. Existing methods struggle with this setting, particularly when evaluated on label spaces mixed from the individual training sets. To overcome these issues, we introduce a simple yet effective multi-dataset training approach by integrating language-based embeddings of class names and label space-specific query embeddings. Our method maintains high performance regardless of the underlying inconsistencies between training datasets. Notably, on four benchmark datasets with label space inconsistencies during inference, we outperform previous methods by 1.6% mIoU for semantic segmentation, 9.1% PQ for panoptic segmentation, 12.1% AP for instance segmentation, and 3.0% in the newly proposed PIQ metric.

Via

Access Paper or Ask Questions

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Apr 23, 2024

Abhishek Aich, Yumin Suh, Samuel Schulter, Manmohan Chandraker

Figure 1 for Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Figure 2 for Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Figure 3 for Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Figure 4 for Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Abstract:A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses ~50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. With this observation, we propose a strategy termed PROgressive Token Length SCALing for Efficient transformer encoders (PRO-SCALE) that can be plugged-in to the Mask2Former-style segmentation architectures to significantly reduce the computational cost. The underlying principle of PRO-SCALE is: progressively scale the length of the tokens with the layers of the encoder. This allows PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance (~52% GFLOPs reduction with no drop in performance on COCO dataset). We validate our framework on multiple public benchmarks.

Via

Access Paper or Ask Questions

Efficient Transformer Encoders for Mask2Former-style models

Apr 23, 2024

Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker

Figure 1 for Efficient Transformer Encoders for Mask2Former-style models

Figure 2 for Efficient Transformer Encoders for Mask2Former-style models

Figure 3 for Efficient Transformer Encoders for Mask2Former-style models

Figure 4 for Efficient Transformer Encoders for Mask2Former-style models

Abstract:Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.

Via

Access Paper or Ask Questions

Efficient Controllable Multi-Task Architectures

Aug 22, 2023

Abhishek Aich, Samuel Schulter, Amit K. Roy-Chowdhury, Manmohan Chandraker, Yumin Suh

Figure 1 for Efficient Controllable Multi-Task Architectures

Figure 2 for Efficient Controllable Multi-Task Architectures

Figure 3 for Efficient Controllable Multi-Task Architectures

Figure 4 for Efficient Controllable Multi-Task Architectures

Abstract:We aim to train a multi-task model such that users can adjust the desired compute budget and relative importance of task performances after deployment, without retraining. This enables optimizing performance for dynamically varying user needs, without heavy computational overhead to train and save models for various scenarios. To this end, we propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable. Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost by jointly adjusting the encoder capacity. This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures based on user's constraints. Our training strategy involves a novel 'Configuration-Invariant Knowledge Distillation' loss that enforces backbone representations to be invariant under different runtime width configurations to enhance accuracy. Further, we present a simple but effective search algorithm that translates user constraints to runtime width configurations of both the shared encoder and task decoders, for sampling the sub-architectures. The key rule for the search algorithm is to provide a larger computational budget to the higher preferred task decoder, while searching a shared encoder configuration that enhances the overall MTL performance. Various experiments on three multi-task benchmarks (PASCALContext, NYUDv2, and CIFAR100-MTL) with diverse backbone architectures demonstrate the advantage of our approach. For example, our method shows a higher controllability by ~33.5% in the NYUD-v2 dataset over prior methods, while incurring much less compute cost.

* ICCV 2023

Via

Access Paper or Ask Questions

Cross-Domain Video Anomaly Detection without Target Domain Adaptation

Dec 14, 2022

Abhishek Aich, Kuan-Chuan Peng, Amit K. Roy-Chowdhury

Abstract:Most cross-domain unsupervised Video Anomaly Detection (VAD) works assume that at least few task-relevant target domain training data are available for adaptation from the source to the target domain. However, this requires laborious model-tuning by the end-user who may prefer to have a system that works ``out-of-the-box." To address such practical scenarios, we identify a novel target domain (inference-time) VAD task where no target domain training data are available. To this end, we propose a new `Zero-shot Cross-domain Video Anomaly Detection (zxvad)' framework that includes a future-frame prediction generative model setup. Different from prior future-frame prediction models, our model uses a novel Normalcy Classifier module to learn the features of normal event videos by learning how such features are different ``relatively" to features in pseudo-abnormal examples. A novel Untrained Convolutional Neural Network based Anomaly Synthesis module crafts these pseudo-abnormal examples by adding foreign objects in normal video frames with no extra training cost. With our novel relative normalcy feature learning strategy, zxvad generalizes and learns to distinguish between normal and abnormal frames in a new target domain without adaptation during inference. Through evaluations on common datasets, we show that zxvad outperforms the state-of-the-art (SOTA), regardless of whether task-relevant (i.e., VAD) source training data are available or not. Lastly, zxvad also beats the SOTA methods in inference-time efficiency metrics including the model size, total parameters, GPU energy consumption, and GMACs.

* Accepted at WACV 2023; Includes Supplementary Material

Via

Access Paper or Ask Questions

Leveraging Local Patch Differences in Multi-Object Scenes for Generative Adversarial Attacks

Oct 03, 2022

Abhishek Aich, Shasha Li, Chengyu Song, M. Salman Asif, Srikanth V. Krishnamurthy, Amit K. Roy-Chowdhury

Figure 1 for Leveraging Local Patch Differences in Multi-Object Scenes for Generative Adversarial Attacks

Figure 2 for Leveraging Local Patch Differences in Multi-Object Scenes for Generative Adversarial Attacks

Figure 3 for Leveraging Local Patch Differences in Multi-Object Scenes for Generative Adversarial Attacks

Figure 4 for Leveraging Local Patch Differences in Multi-Object Scenes for Generative Adversarial Attacks

Abstract:State-of-the-art generative model-based attacks against image classifiers overwhelmingly focus on single-object (i.e., single dominant object) images. Different from such settings, we tackle a more practical problem of generating adversarial perturbations using multi-object (i.e., multiple dominant objects) images as they are representative of most real-world scenes. Our goal is to design an attack strategy that can learn from such natural scenes by leveraging the local patch differences that occur inherently in such images (e.g. difference between the local patch on the object `person' and the object `bike' in a traffic scene). Our key idea is to misclassify an adversarial multi-object image by confusing the victim classifier for each local patch in the image. Based on this, we propose a novel generative attack (called Local Patch Difference or LPD-Attack) where a novel contrastive loss function uses the aforesaid local differences in feature space of multi-object scenes to optimize the perturbation generator. Through various experiments across diverse victim convolutional neural networks, we show that our approach outperforms baseline generative attacks with highly transferable perturbations when evaluated under different white-box and black-box settings.

* Accepted at WACV 2023 (Round 1), camera-ready version

Via

Access Paper or Ask Questions

GAMA: Generative Adversarial Multi-Object Scene Attacks

Sep 20, 2022

Abhishek Aich, Calvin Khang-Ta, Akash Gupta, Chengyu Song, Srikanth V. Krishnamurthy, M. Salman Asif, Amit K. Roy-Chowdhury

Figure 1 for GAMA: Generative Adversarial Multi-Object Scene Attacks

Figure 2 for GAMA: Generative Adversarial Multi-Object Scene Attacks

Figure 3 for GAMA: Generative Adversarial Multi-Object Scene Attacks

Figure 4 for GAMA: Generative Adversarial Multi-Object Scene Attacks

Abstract:The majority of methods for crafting adversarial attacks have focused on scenes with a single dominant object (e.g., images from ImageNet). On the other hand, natural scenes include multiple dominant objects that are semantically related. Thus, it is crucial to explore designing attack strategies that look beyond learning on single-object scenes or attack single-object victim classifiers. Due to their inherent property of strong transferability of perturbations to unknown models, this paper presents the first approach of using generative models for adversarial attacks on multi-object scenes. In order to represent the relationships between different objects in the input scene, we leverage upon the open-sourced pre-trained vision-language model CLIP (Contrastive Language-Image Pre-training), with the motivation to exploit the encoded semantics in the language space along with the visual space. We call this attack approach Generative Adversarial Multi-object scene Attacks (GAMA). GAMA demonstrates the utility of the CLIP model as an attacker's tool to train formidable perturbation generators for multi-object scenes. Using the joint image-text features to train the generator, we show that GAMA can craft potent transferable perturbations in order to fool victim classifiers in various attack settings. For example, GAMA triggers ~16% more misclassification than state-of-the-art generative approaches in black-box settings where both the classifier architecture and data distribution of the attacker are different from the victim. Our code will be made publicly available soon.

* Accepted at NeurIPS 2022; First two authors contributed equally; Includes Supplementary Material

Via

Access Paper or Ask Questions

Poisson2Sparse: Self-Supervised Poisson Denoising From a Single Image

Jun 04, 2022

Calvin-Khang Ta, Abhishek Aich, Akash Gupta, Amit K. Roy-Chowdhury

Figure 1 for Poisson2Sparse: Self-Supervised Poisson Denoising From a Single Image

Figure 2 for Poisson2Sparse: Self-Supervised Poisson Denoising From a Single Image

Figure 3 for Poisson2Sparse: Self-Supervised Poisson Denoising From a Single Image

Figure 4 for Poisson2Sparse: Self-Supervised Poisson Denoising From a Single Image

Abstract:Image enhancement approaches often assume that the noise is signal independent, and approximate the degradation model as zero-mean additive Gaussian noise. However, this assumption does not hold for biomedical imaging systems where sensor-based sources of noise are proportional to signal strengths, and the noise is better represented as a Poisson process. In this work, we explore a sparsity and dictionary learning-based approach and present a novel self-supervised learning method for single-image denoising where the noise is approximated as a Poisson process, requiring no clean ground-truth data. Specifically, we approximate traditional iterative optimization algorithms for image denoising with a recurrent neural network which enforces sparsity with respect to the weights of the network. Since the sparse representations are based on the underlying image, it is able to suppress the spurious components (noise) in the image patches, thereby introducing implicit regularization for denoising task through the network structure. Experiments on two bio-imaging datasets demonstrate that our method outperforms the state-of-the-art approaches in terms of PSNR and SSIM. Our qualitative results demonstrate that, in addition to higher performance on standard quantitative metrics, we are able to recover much more subtle details than other compared approaches.

Via

Access Paper or Ask Questions

Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations

Oct 05, 2021

Shasha Li, Abhishek Aich, Shitong Zhu, M. Salman Asif, Chengyu Song, Amit K. Roy-Chowdhury, Srikanth Krishnamurthy

Figure 1 for Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations

Figure 2 for Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations

Figure 3 for Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations

Figure 4 for Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations

Abstract:When compared to the image classification models, black-box adversarial attacks against video classification models have been largely understudied. This could be possible because, with video, the temporal dimension poses significant additional challenges in gradient estimation. Query-efficient black-box attacks rely on effectively estimated gradients towards maximizing the probability of misclassifying the target video. In this work, we demonstrate that such effective gradients can be searched for by parameterizing the temporal structure of the search space with geometric transformations. Specifically, we design a novel iterative algorithm Geometric TRAnsformed Perturbations (GEO-TRAP), for attacking video classification models. GEO-TRAP employs standard geometric transformation operations to reduce the search space for effective gradients into searching for a small group of parameters that define these operations. This group of parameters describes the geometric progression of gradients, resulting in a reduced and structured search space. Our algorithm inherently leads to successful perturbations with surprisingly few queries. For example, adversarial examples generated from GEO-TRAP have better attack success rates with ~73.55% fewer queries compared to the state-of-the-art method for video adversarial attacks on the widely used Jester dataset. Overall, our algorithm exposes vulnerabilities of diverse video classification models and achieves new state-of-the-art results under black-box settings on two large datasets.

* Accepted at NeurIPS 2021; First two authors contributed equally; includes Supplementary Material

Via

Access Paper or Ask Questions

Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Aug 15, 2021

Abhishek Aich, Meng Zheng, Srikrishna Karanam, Terrence Chen, Amit K. Roy-Chowdhury, Ziyan Wu

Figure 1 for Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Figure 2 for Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Figure 3 for Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Figure 4 for Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Abstract:Despite much recent progress in video-based person re-identification (re-ID), the current state-of-the-art still suffers from common real-world challenges such as appearance similarity among various people, occlusions, and frame misalignment. To alleviate these problems, we propose Spatio-Temporal Representation Factorization (STRF), a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. The key innovations of STRF over prior work include explicit pathways for learning discriminative temporal and spatial features, with each component further factorized to capture complementary person-specific appearance and motion information. Specifically, temporal factorization comprises two branches, one each for static features (e.g., the color of clothes) that do not change much over time, and dynamic features (e.g., walking patterns) that change over time. Further, spatial factorization also comprises two branches to learn both global (coarse segments) as well as local (finer segments) appearance features, with the local features particularly useful in cases of occlusion or spatial misalignment. These two factorization operations taken together result in a modular architecture for our parameter-wise light STRF unit that can be plugged in between any two 3D convolutional layers, resulting in an end-to-end learning framework. We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results using standard person re-ID evaluation protocols on three benchmarks.

* Accepted at IEEE ICCV 2021, Includes Supplementary Material

Via

Access Paper or Ask Questions