Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Deepu Rajan

Sherman

Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning

Dec 12, 2024

Meng Shen, Yake Wei, Jianxiong Yin, Deepu Rajan, Di Hu, Simon See

Abstract:Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.

* 11 pages, ACMMM Asia 2024, Oral Presentation

Via

Access Paper or Ask Questions

Situational Scene Graph for Structured Human-centric Situation Understanding

Oct 30, 2024

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

Abstract:Graph based representation has been widely used in modelling spatio-temporal relationships in video understanding. Although effective, existing graph-based approaches focus on capturing the human-object relationships while ignoring fine-grained semantic properties of the action components. These semantic properties are crucial for understanding the current situation, such as where does the action takes place, what tools are used and functional properties of the objects. In this work, we propose a graph-based representation called Situational Scene Graph (SSG) to encode both human-object relationships and the corresponding semantic properties. The semantic details are represented as predefined roles and values inspired by situation frame, which is originally designed to represent a single action. Based on our proposed representation, we introduce the task of situational scene graph generation and propose a multi-stage pipeline Interactive and Complementary Network (InComNet) to address the task. Given that the existing datasets are not applicable to the task, we further introduce a SSG dataset whose annotations consist of semantic role-value frames for human, objects and verb predicates of human-object relations. Finally, we demonstrate the effectiveness of our proposed SSG representation by testing on different downstream tasks. Experimental results show that the unified representation can not only benefit predicate classification and semantic role-value classification, but also benefit reasoning tasks on human-centric situation understanding. We will release the code and the dataset soon.

* Accepted for WACV 2025

Via

Access Paper or Ask Questions

A Unified Framework for Guiding Generative AI with Wireless Perception in Resource Constrained Mobile Edge Networks

Sep 04, 2023

Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Deepu Rajan, Shiwen Mao, Xuemin, Shen

Abstract:With the significant advancements in artificial intelligence (AI) technologies and powerful computational capabilities, generative AI (GAI) has become a pivotal digital content generation technique for offering superior digital services. However, directing GAI towards desired outputs still suffer the inherent instability of the AI model. In this paper, we design a novel framework that utilizes wireless perception to guide GAI (WiPe-GAI) for providing digital content generation service, i.e., AI-generated content (AIGC), in resource-constrained mobile edge networks. Specifically, we first propose a new sequential multi-scale perception (SMSP) algorithm to predict user skeleton based on the channel state information (CSI) extracted from wireless signals. This prediction then guides GAI to provide users with AIGC, such as virtual character generation. To ensure the efficient operation of the proposed framework in resource constrained networks, we further design a pricing-based incentive mechanism and introduce a diffusion model based approach to generate an optimal pricing strategy for the service provisioning. The strategy maximizes the user's utility while enhancing the participation of the virtual service provider (VSP) in AIGC provision. The experimental results demonstrate the effectiveness of the designed framework in terms of skeleton prediction and optimal pricing strategy generation comparing with other existing solutions.

Via

Access Paper or Ask Questions

Towards Balanced Active Learning for Multimodal Classification

Jun 14, 2023

Meng Shen, Yizheng Huang, Jianxiong Yin, Heqing Zou, Deepu Rajan, Simon See

Figure 1 for Towards Balanced Active Learning for Multimodal Classification

Figure 2 for Towards Balanced Active Learning for Multimodal Classification

Figure 3 for Towards Balanced Active Learning for Multimodal Classification

Figure 4 for Towards Balanced Active Learning for Multimodal Classification

Abstract:Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality. This unfairness hinders balanced multimodal learning, which is crucial for achieving optimal performance. To address this issue, we propose three guidelines for designing a more balanced multimodal active learning strategy. Following these guidelines, a novel approach is proposed to achieve more fair data selection by modulating the gradient embedding with the dominance degree among modalities. Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality. Our approach outperforms existing active learning strategies on a variety of multimodal classification tasks. Overall, our work highlights the importance of balancing sample selection in multimodal active learning and provides a practical solution for achieving more balanced active learning for multimodal classification.

* 11 pages

Via

Access Paper or Ask Questions

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

May 16, 2023

Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, Eng Siong Chng

Figure 1 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Figure 2 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Figure 3 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Figure 4 for UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Abstract:Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting. Specifically, we first capture task-related unimodal representations and the unimodal predictions from the introduced unimodal predicting task. Then the unimodal representations are aligned with the more effective one by the designed multimodal contrastive method under the supervision of the unimodal predictions. Experimental results with fused features on two image-text classification benchmarks UPMC-Food-101 and N24News show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods. The detailed ablation study and analysis further demonstrate the advantage of our proposed method.

* ACL 2023 Findings

Via

Access Paper or Ask Questions

Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Mar 29, 2022

Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, Eng Siong Chng

Figure 1 for Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Figure 2 for Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Figure 3 for Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Figure 4 for Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Abstract:Speech Emotion Recognition (SER) aims to help the machine to understand human's subjective emotion from only audio information. However, extracting and utilizing comprehensive in-depth audio information is still a challenging task. In this paper, we propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module. We firstly extract multi-level acoustic information, including MFCC, spectrogram, and the embedded high-level acoustic information with CNN, BiLSTM and wav2vec2, respectively. Then these extracted features are treated as multimodal inputs and fused by the proposed co-attention mechanism. Experiments are carried on the IEMOCAP dataset, and our model achieves competitive performance with two different speaker-independent cross-validation strategies. Our code is available on GitHub.

* Accepted by ICASSP 2022

Via

Access Paper or Ask Questions

Are object detection assessment criteria ready for maritime computer vision?

Sep 12, 2018

Dilip K. Prasad, Deepu Rajan, Chai Quek

Figure 1 for Are object detection assessment criteria ready for maritime computer vision?

Figure 2 for Are object detection assessment criteria ready for maritime computer vision?

Figure 3 for Are object detection assessment criteria ready for maritime computer vision?

Figure 4 for Are object detection assessment criteria ready for maritime computer vision?

Abstract:Maritime vessels equipped with visible and infrared cameras can complement other conventional sensors for object detection. However, application of computer vision techniques in maritime domain received attention only recently. Maritime environment offers its own unique requirements and challenges. Assessment of quality of detections is a fundamental need in computer vision. However, the conventional assessment metrics suitable for usual object detection are deficient in maritime setting. Thus, a large body of related work in computer vision appears inapplicable to maritime setting at the first sight. We discuss the problem of defining assessment metrics suitable for maritime computer vision. We consider new bottom edge proximity metrics as assessment metrics for maritime computer vision. These metrics indicate that existing computer vision approaches are indeed promising for maritime computer vision and can play a foundational role in the emerging field of maritime computer vision.

Via

Access Paper or Ask Questions

Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection

Aug 14, 2018

Hisham Cholakkal, Jubin Johnson, Deepu Rajan

Figure 1 for Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection

Figure 2 for Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection

Figure 3 for Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection

Figure 4 for Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection

Abstract:Top-down saliency models produce a probability map that peaks at target locations specified by a task/goal such as object detection. They are usually trained in a fully supervised setting involving pixel-level annotations of objects. We propose a weakly supervised top-down saliency framework using only binary labels that indicate the presence/absence of an object in an image. First, the probabilistic contribution of each image region to the confidence of a CNN-based image classifier is computed through a backtracking strategy to produce top-down saliency. From a set of saliency maps of an image produced by fast bottom-up saliency approaches, we select the best saliency map suitable for the top-down task. The selected bottom-up saliency map is combined with the top-down saliency map. Features having high combined saliency are used to train a linear SVM classifier to estimate feature saliency. This is integrated with combined saliency and further refined through a multi-scale superpixel-averaging of saliency map. We evaluate the performance of the proposed weakly supervised topdown saliency and achieve comparable performance with fully supervised approaches. Experiments are carried out on seven challenging datasets and quantitative results are compared with 40 closely related approaches across 4 different applications.

* H. Cholakkal, J. Johnson, D. Rajan, "Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection", in IEEE Transactions on Image processing, August 2018
* 14 pages, 7 figures

Via

Access Paper or Ask Questions

L1-regularized Reconstruction Error as Alpha Matte

Feb 09, 2017

Jubin Johnson, Hisham Cholakkal, Deepu Rajan

Figure 1 for L1-regularized Reconstruction Error as Alpha Matte

Figure 2 for L1-regularized Reconstruction Error as Alpha Matte

Figure 3 for L1-regularized Reconstruction Error as Alpha Matte

Figure 4 for L1-regularized Reconstruction Error as Alpha Matte

Abstract:Sampling-based alpha matting methods have traditionally followed the compositing equation to estimate the alpha value at a pixel from a pair of foreground (F) and background (B) samples. The (F,B) pair that produces the least reconstruction error is selected, followed by alpha estimation. The significance of that residual error has been left unexamined. In this letter, we propose a video matting algorithm that uses L1-regularized reconstruction error of F and B samples as a measure of the alpha matte. A multi-frame non-local means framework using coherency sensitive hashing is utilized to ensure temporal coherency in the video mattes. Qualitative and quantitative evaluations on a dataset exclusively for video matting demonstrate the effectiveness of the proposed matting algorithm.

* 5 pages, 5 figure, Accepted in IEEE Signal Processing Letters

Via

Access Paper or Ask Questions

A Classifier-guided Approach for Top-down Salient Object Detection

Apr 22, 2016

Hisham Cholakkal, Jubin Johnson, Deepu Rajan

Figure 1 for A Classifier-guided Approach for Top-down Salient Object Detection

Figure 2 for A Classifier-guided Approach for Top-down Salient Object Detection

Figure 3 for A Classifier-guided Approach for Top-down Salient Object Detection

Figure 4 for A Classifier-guided Approach for Top-down Salient Object Detection

Abstract:We propose a framework for top-down salient object detection that incorporates a tightly coupled image classification module. The classifier is trained on novel category-aware sparse codes computed on object dictionaries used for saliency modeling. A misclassification indicates that the corresponding saliency model is inaccurate. Hence, the classifier selects images for which the saliency models need to be updated. The category-aware sparse coding produces better image classification accuracy as compared to conventional sparse coding with a reduced computational complexity. A saliency-weighted max-pooling is proposed to improve image classification, which is further used to refine the saliency maps. Experimental results on Graz-02 and PASCAL VOC-07 datasets demonstrate the effectiveness of salient object detection. Although the role of the classifier is to support salient object detection, we evaluate its performance in image classification and also illustrate the utility of thresholded saliency maps for image segmentation.

* To appear in Signal Processing: Image Communication, Elsevier. Available online from April 2016

Via

Access Paper or Ask Questions