Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shruti Vyas

LR0.FM: Low-Resolution Zero-shot Classification Benchmark For Foundation Models

Feb 07, 2025

Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat

Abstract:Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at https://github.com/shyammarjit/LR0.FM

* Accepted to ICLR 2025

Via

Access Paper or Ask Questions

PV-S3: Advancing Automatic Photovoltaic Defect Detection using Semi-Supervised Semantic Segmentation of Electroluminescence Images

Apr 21, 2024

Abhishek Jha, Yogesh Rawat, Shruti Vyas

Abstract:Photovoltaic (PV) systems allow us to tap into all abundant solar energy, however they require regular maintenance for high efficiency and to prevent degradation. Traditional manual health check, using Electroluminescence (EL) imaging, is expensive and logistically challenging making automated defect detection essential. Current automation approaches require extensive manual expert labeling, which is time-consuming, expensive, and prone to errors. We propose PV-S3 (Photovoltaic-Semi Supervised Segmentation), a Semi-Supervised Learning approach for semantic segmentation of defects in EL images that reduces reliance on extensive labeling. PV-S3 is a Deep learning model trained using a few labeled images along with numerous unlabeled images. We introduce a novel Semi Cross-Entropy loss function to train PV-S3 which addresses the challenges specific to automated PV defect detection, such as diverse defect types and class imbalance. We evaluate PV-S3 on multiple datasets and demonstrate its effectiveness and adaptability. With merely 20% labeled samples, we achieve an absolute improvement of 9.7% in IoU, 29.9% in Precision, 12.75% in Recall, and 20.42% in F1-Score over prior state-of-the-art supervised method (which uses 100% labeled samples) on UCF-EL dataset (largest dataset available for semantic segmentation of EL images) showing improvement in performance while reducing the annotation costs by 80%.

Via

Access Paper or Ask Questions

Semi-supervised Active Learning for Video Action Detection

Dec 12, 2023

Aayush Singh, Aayush J Rana, Akash Kumar, Shruti Vyas, Yogesh Singh Rawat

Figure 1 for Semi-supervised Active Learning for Video Action Detection

Figure 2 for Semi-supervised Active Learning for Video Action Detection

Figure 3 for Semi-supervised Active Learning for Video Action Detection

Figure 4 for Semi-supervised Active Learning for Video Action Detection

Abstract:In this work, we focus on label efficient learning for video action detection. We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data along with informative sample selection for action detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning informative sample selection as well as semi-supervised learning pseudo label generation. First, we propose NoiseAug, a simple augmentation strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB-21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos.

* AAAI'24 Main Conference

Via

Access Paper or Ask Questions

GAMa: Cross-view Video Geo-localization

Jul 06, 2022

Shruti Vyas, Chen Chen, Mubarak Shah

Figure 1 for GAMa: Cross-view Video Geo-localization

Figure 2 for GAMa: Cross-view Video Geo-localization

Figure 3 for GAMa: Cross-view Video Geo-localization

Figure 4 for GAMa: Cross-view Video Geo-localization

Abstract:The existing work in cross-view geo-localization is based on images where a ground panorama is matched to an aerial image. In this work, we focus on ground videos instead of images which provides additional contextual cues which are important for this task. There are no existing datasets for this problem, therefore we propose GAMa dataset, a large-scale dataset with ground videos and corresponding aerial images. We also propose a novel approach to solve this problem. At clip-level, a short video clip is matched with corresponding aerial image and is later used to get video-level geo-localization of a long video. Moreover, we propose a hierarchical approach to further improve the clip-level geolocalization. It is a challenging dataset, unaligned and limited field of view, and our proposed method achieves a Top-1 recall rate of 19.4% and 45.1% @1.0mile. Code and dataset are available at following link: https://github.com/svyas23/GAMa.

* ECCV 2022

Via

Access Paper or Ask Questions

Multi-modal Robustness Analysis Against Language and Visual Perturbations

Jul 06, 2022

Madeline C. Schiappa, Shruti Vyas, Hamid Palangi, Yogesh S. Rawat, Vibhav Vineet

Figure 1 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Figure 2 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Figure 3 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Figure 4 for Multi-modal Robustness Analysis Against Language and Visual Perturbations

Abstract:Joint visual and language modeling on large-scale datasets has recently shown a good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of such models against various real-world perturbations focusing on video and language. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different textual perturbations. The study reveals some interesting findings: 1) The studied models are more robust when text is perturbed versus when video is perturbed 2) The transformer text encoder is more robust on non-semantic changing text perturbations and visual perturbations compared to word embedding approaches. 3) Using two-branch encoders in isolation is typically more robust than when architectures use cross-attention. We hope this study will serve as a benchmark and guide future research in robust multimodal learning.

* 29 pages, 21 figures. This projects webpage is located at https://maddy12.github.io/MultiModalVideoRobustness/

Via

Access Paper or Ask Questions

Large-scale Robustness Analysis of Video Action Recognition Models

Jul 04, 2022

Madeline C. Schiappa, Naman Biyani, Shruti Vyas, Hamid Palangi, Vibhav Vineet, Yogesh Rawat

Figure 1 for Large-scale Robustness Analysis of Video Action Recognition Models

Figure 2 for Large-scale Robustness Analysis of Video Action Recognition Models

Figure 3 for Large-scale Robustness Analysis of Video Action Recognition Models

Figure 4 for Large-scale Robustness Analysis of Video Action Recognition Models

Abstract:We have seen a great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) with some recent transformer based approaches which provide state-of-the-art performance on existing benchmark datasets. However, large-scale robustness has not been studied for these models which is a critical aspect for real-world applications. In this work we perform a large-scale robustness analysis of these existing models for video action recognition. We mainly focus on robustness against distribution shifts due to real-world perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB-51P, UCF-101P, Kinetics-400P, and SSv2P and study the robustness of six different state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) transformer based models are consistently more robust against most of the perturbations when compared with CNN based models, 2) Pretraining helps Transformer based models to be more robust to different perturbations than CNN based models, and 3) All of the studied models are robust to temporal perturbation on the Kinetics dataset, but not on SSv2; this suggests temporal information is much more important for action label prediction on SSv2 datasets than on the Kinetics dataset. We hope that this study will serve as a benchmark for future research in robust video action recognition. More details about the project are available at https://rose-ar.github.io/.

* 26 pages, 21 figures

Via

Access Paper or Ask Questions

Video Action Detection: Analysing Limitations and Challenges

Apr 17, 2022

Rajat Modi, Aayush Jung Rana, Akash Kumar, Praveen Tirupattur, Shruti Vyas, Yogesh Singh Rawat, Mubarak Shah

Figure 1 for Video Action Detection: Analysing Limitations and Challenges

Figure 2 for Video Action Detection: Analysing Limitations and Challenges

Figure 3 for Video Action Detection: Analysing Limitations and Challenges

Figure 4 for Video Action Detection: Analysing Limitations and Challenges

Abstract:Beyond possessing large enough size to feed data hungry machines (eg, transformers), what attributes measure the quality of a dataset? Assuming that the definitions of such attributes do exist, how do we quantify among their relative existences? Our work attempts to explore these questions for video action detection. The task aims to spatio-temporally localize an actor and assign a relevant action class. We first analyze the existing datasets on video action detection and discuss their limitations. Next, we propose a new dataset, Multi Actor Multi Action (MAMA) which overcomes these limitations and is more suitable for real world applications. In addition, we perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect. This reveals if the actions in these datasets really need the motion information of an actor, or whether they predict the occurrence of an action even by looking at a single frame. Finally, we investigate the widely held assumptions on the importance of temporal ordering: is temporal ordering important for detecting these actions? Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.

* CVPRW'22

Via

Access Paper or Ask Questions

LARNet: Latent Action Representation for Human Action Synthesis

Oct 27, 2021

Naman Biyani, Aayush J Rana, Shruti Vyas, Yogesh S Rawat

Figure 1 for LARNet: Latent Action Representation for Human Action Synthesis

Figure 2 for LARNet: Latent Action Representation for Human Action Synthesis

Figure 3 for LARNet: Latent Action Representation for Human Action Synthesis

Figure 4 for LARNet: Latent Action Representation for Human Action Synthesis

Abstract:We present LARNet, a novel end-to-end approach for generating human action videos. A joint generative modeling of appearance and dynamics to synthesize a video is very challenging and therefore recent works in video synthesis have proposed to decompose these two factors. However, these methods require a driving video to model the video dynamics. In this work, we propose a generative approach instead, which explicitly learns action dynamics in latent space avoiding the need of a driving video during inference. The generated action dynamics is integrated with the appearance using a recurrent hierarchical structure which induces motion at different scales to focus on both coarse as well as fine level action details. In addition, we propose a novel mix-adversarial loss function which aims at improving the temporal coherency of synthesized videos. We evaluate the proposed approach on four real-world human action datasets demonstrating the effectiveness of the proposed approach in generating human actions. Code available at https://github.com/aayushjr/larnet.

* British Machine Vision Conference (BMVC) 2021

Via

Access Paper or Ask Questions

Pose-guided Generative Adversarial Net for Novel View Action Synthesis

Oct 15, 2021

Xianhang Li, Junhao Zhang, Kunchang Li, Shruti Vyas, Yogesh S Rawat

Figure 1 for Pose-guided Generative Adversarial Net for Novel View Action Synthesis

Figure 2 for Pose-guided Generative Adversarial Net for Novel View Action Synthesis

Figure 3 for Pose-guided Generative Adversarial Net for Novel View Action Synthesis

Figure 4 for Pose-guided Generative Adversarial Net for Novel View Action Synthesis

Abstract:We focus on the problem of novel-view human action synthesis. Given an action video, the goal is to generate the same action from an unseen viewpoint. Naturally, novel view video synthesis is more challenging than image synthesis. It requires the synthesis of a sequence of realistic frames with temporal coherency. Besides, transferring the different actions to a novel target view requires awareness of action category and viewpoint change simultaneously. To address these challenges, we propose a novel framework named Pose-guided Action Separable Generative Adversarial Net (PAS-GAN), which utilizes pose to alleviate the difficulty of this task. First, we propose a recurrent pose-transformation module which transforms actions from the source view to the target view and generates novel view pose sequence in 2D coordinate space. Second, a well-transformed pose sequence enables us to separatethe action and background in the target view. We employ a novel local-global spatial transformation module to effectively generate sequential video features in the target view using these action and background features. Finally, the generated video features are used to synthesize human action with the help of a 3D decoder. Moreover, to focus on dynamic action in the video, we propose a novel multi-scale action-separable loss which further improves the video quality. We conduct extensive experiments on two large-scale multi-view human action datasets, NTU-RGBD and PKU-MMD, demonstrating the effectiveness of PAS-GAN which outperforms existing approaches.

* Accepted by WACV2022

Via

Access Paper or Ask Questions

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Oct 13, 2021

Mohit Sharma, Raj Patra, Harshal Desai, Shruti Vyas, Yogesh Rawat, Rajiv Ratn Shah

Figure 1 for NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Figure 2 for NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Figure 3 for NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Figure 4 for NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Abstract:Deep learning has shown remarkable progress in a wide range of problems. However, efficient training of such models requires large-scale datasets, and getting annotations for such datasets can be challenging and costly. In this work, we explore the use of user-generated freely available labels from web videos for video understanding. We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information. We utilize the collected dataset for action classification and demonstrate its usefulness with existing small-scale annotated datasets, UCF101 and HMDB51. We study different loss functions and two pretraining strategies, simple and self-supervised learning. We also show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets. We present this as a benchmark dataset in noisy learning for video understanding. The dataset, code, and trained models will be publicly available for future research.

* Accepted at ACM Multimedia Asia 2021

Via

Access Paper or Ask Questions