Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Malitha Gunawardhana

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

Feb 08, 2025

Shihab Aaqil Ahamed, Malitha Gunawardhana, Liel David, Michael Sidorov, Daniel Harari, Muhammad Haris Khan

Abstract:Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cognition for space-time correspondence. This can limit the model's ability to capture the essence of certain actions that are contextually rich and continuous. Humans are capable of mapping visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE an end-to-end self-supervised cross-modal contrastive learning MAE that effectively learns both video-level and frame-level rich spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space, while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous state-of-the-art methods and ablation studies validate the effectiveness of our approach.

Via

Access Paper or Ask Questions

How Effective are Self-Supervised Models for Contact Identification in Videos

Aug 01, 2024

Malitha Gunawardhana, Limalka Sadith, Liel David, Daniel Harari, Muhammad Haris Khan

Figure 1 for How Effective are Self-Supervised Models for Contact Identification in Videos

Figure 2 for How Effective are Self-Supervised Models for Contact Identification in Videos

Figure 3 for How Effective are Self-Supervised Models for Contact Identification in Videos

Figure 4 for How Effective are Self-Supervised Models for Contact Identification in Videos

Abstract:The exploration of video content via Self-Supervised Learning (SSL) models has unveiled a dynamic field of study, emphasizing both the complex challenges and unique opportunities inherent in this area. Despite the growing body of research, the ability of SSL models to detect physical contacts in videos remains largely unexplored, particularly the effectiveness of methods such as downstream supervision with linear probing or full fine-tuning. This work aims to bridge this gap by employing eight different convolutional neural networks (CNNs) based video SSL models to identify instances of physical contact within video sequences specifically. The Something-Something v2 (SSv2) and Epic-Kitchen (EK-100) datasets were chosen for evaluating these approaches due to the promising results on UCF101 and HMDB51, coupled with their limited prior assessment on SSv2 and EK-100. Additionally, these datasets feature diverse environments and scenarios, essential for testing the robustness and accuracy of video-based models. This approach not only examines the effectiveness of each model in recognizing physical contacts but also explores the performance in the action recognition downstream task. By doing so, valuable insights into the adaptability of SSL models in interpreting complex, dynamic visual information are contributed.

* 15 pages, 6 figures

Via

Access Paper or Ask Questions

Dynamic Position Transformation and Boundary Refinement Network for Left Atrial Segmentation

Jul 07, 2024

Fangqiang Xu, Wenxuan Tu, Fan Feng, Malitha Gunawardhana, Jiayuan Yang, Yun Gu, Jichao Zhao

Abstract:Left atrial (LA) segmentation is a crucial technique for irregular heartbeat (i.e., atrial fibrillation) diagnosis. Most current methods for LA segmentation strictly assume that the input data is acquired using object-oriented center cropping, while this assumption may not always hold in practice due to the high cost of manual object annotation. Random cropping is a straightforward data pre-processing approach. However, it 1) introduces significant irregularities and incompleteness in the input data and 2) disrupts the coherence and continuity of object boundary regions. To tackle these issues, we propose a novel Dynamic Position transformation and Boundary refinement Network (DPBNet). The core idea is to dynamically adjust the relative position of irregular targets to construct their contextual relationships and prioritize difficult boundary pixels to enhance foreground-background distinction. Specifically, we design a shuffle-then-reorder attention module to adjust the position of disrupted objects in the latent space using dynamic generation ratios, such that the vital dependencies among these random cropping targets could be well captured and preserved. Moreover, to improve the accuracy of boundary localization, we introduce a dual fine-grained boundary loss with scenario-adaptive weights to handle the ambiguity of the dual boundary at a fine-grained level, promoting the clarity and continuity of the obtained results. Extensive experimental results on benchmark dataset have demonstrated that DPBNet consistently outperforms existing state-of-the-art methods.

* MICCAI 2024 conference

Via

Access Paper or Ask Questions

Towards Generalizing to Unseen Domains with Few Labels

Mar 19, 2024

Chamuditha Jayanga Galappaththige, Sanoojan Baliah, Malitha Gunawardhana, Muhammad Haris Khan

Figure 1 for Towards Generalizing to Unseen Domains with Few Labels

Figure 2 for Towards Generalizing to Unseen Domains with Few Labels

Figure 3 for Towards Generalizing to Unseen Domains with Few Labels

Figure 4 for Towards Generalizing to Unseen Domains with Few Labels

Abstract:We approach the challenge of addressing semi-supervised domain generalization (SSDG). Specifically, our aim is to obtain a model that learns domain-generalizable features by leveraging a limited subset of labelled data alongside a substantially larger pool of unlabeled data. Existing domain generalization (DG) methods which are unable to exploit unlabeled data perform poorly compared to semi-supervised learning (SSL) methods under SSDG setting. Nevertheless, SSL methods have considerable room for performance improvement when compared to fully-supervised DG training. To tackle this underexplored, yet highly practical problem of SSDG, we make the following core contributions. First, we propose a feature-based conformity technique that matches the posterior distributions from the feature space with the pseudo-label from the model's output space. Second, we develop a semantics alignment loss to learn semantically-compatible representations by regularizing the semantic structure in the feature space. Our method is plug-and-play and can be readily integrated with different SSL-based SSDG baselines without introducing any additional parameters. Extensive experimental results across five challenging DG benchmarks with four strong SSL baselines suggest that our method provides consistent and notable gains in two different SSDG settings.

* Accepted at CVPR 2024

Via

Access Paper or Ask Questions

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Mar 05, 2024

Kumaranage Ravindu Yasas Nagasinghe, Honglu Zhou, Malitha Gunawardhana, Martin Renqiang Min, Daniel Harari, Muhammad Haris Khan

Figure 1 for Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Figure 2 for Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Figure 3 for Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Figure 4 for Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Abstract:In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the capabilities of the agent by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision.

* 8 pages, 6 figures, (supplementary material: 8 pages, 5 figures), accepted to CVPR 2024

Via

Access Paper or Ask Questions

Performance of Machine Learning Classification in Mammography Images using BI-RADS

Nov 14, 2023

Malitha Gunawardhana, Norbert Zolek

Abstract:This research aims to investigate the classification accuracy of various state-of-the-art image classification models across different categories of breast ultrasound images, as defined by the Breast Imaging Reporting and Data System (BI-RADS). To achieve this, we have utilized a comprehensively assembled dataset of 2,945 mammographic images sourced from 1,540 patients. In order to conduct a thorough analysis, we employed six advanced classification architectures, including VGG19 \cite{simonyan2014very}, ResNet50 \cite{he2016deep}, GoogleNet \cite{szegedy2015going}, ConvNext \cite{liu2022convnet}, EfficientNet \cite{tan2019efficientnet}, and Vision Transformers (ViT) \cite{dosovitskiy2020image}, instead of traditional machine learning models. We evaluate models in three different settings: full fine-tuning, linear evaluation and training from scratch. Our findings demonstrate the effectiveness and capability of our Computer-Aided Diagnosis (CAD) system, with a remarkable accuracy of 76.39\% and an F1 score of 67.94\% in the full fine-tuning setting. Our findings indicate the potential for enhanced diagnostic accuracy in the field of breast imaging, providing a solid foundation for future endeavors aiming to improve the precision and reliability of CAD systems in medical imaging.

Via

Access Paper or Ask Questions

Multiclass Confidence and Localization Calibration for Object Detection

Jun 14, 2023

Bimsara Pathiraja, Malitha Gunawardhana, Muhammad Haris Khan

Figure 1 for Multiclass Confidence and Localization Calibration for Object Detection

Figure 2 for Multiclass Confidence and Localization Calibration for Object Detection

Figure 3 for Multiclass Confidence and Localization Calibration for Object Detection

Figure 4 for Multiclass Confidence and Localization Calibration for Object Detection

Abstract:Albeit achieving high predictive accuracy across many challenging computer vision problems, recent studies suggest that deep neural networks (DNNs) tend to make overconfident predictions, rendering them poorly calibrated. Most of the existing attempts for improving DNN calibration are limited to classification tasks and restricted to calibrating in-domain predictions. Surprisingly, very little to no attempts have been made in studying the calibration of object detection methods, which occupy a pivotal space in vision-based security-sensitive, and safety-critical applications. In this paper, we propose a new train-time technique for calibrating modern object detection methods. It is capable of jointly calibrating multiclass confidence and box localization by leveraging their predictive uncertainties. We perform extensive experiments on several in-domain and out-of-domain detection benchmarks. Results demonstrate that our proposed train-time calibration method consistently outperforms several baselines in reducing calibration error for both in-domain and out-of-domain predictions. Our code and models are available at https://github.com/bimsarapathiraja/MCCL.

* Project page - https://bimsarapathiraja.github.io/mccl-project-page/

Via

Access Paper or Ask Questions