Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Saumik Bhattacharya

F-ANcGAN: An Attention-Enhanced Cycle Consistent Generative Adversarial Architecture for Synthetic Image Generation of Nanoparticles

May 23, 2025

Varun Ajith, Anindya Pal, Saumik Bhattacharya, Sayantari Ghosh

Abstract:Nanomaterial research is becoming a vital area for energy, medicine, and materials science, and accurate analysis of the nanoparticle topology is essential to determine their properties. Unfortunately, the lack of high-quality annotated datasets drastically hinders the creation of strong segmentation models for nanoscale imaging. To alleviate this problem, we introduce F-ANcGAN, an attention-enhanced cycle consistent generative adversarial system that can be trained using a limited number of data samples and generates realistic scanning electron microscopy (SEM) images directly from segmentation maps. Our model uses a Style U-Net generator and a U-Net segmentation network equipped with self-attention to capture structural relationships and applies augmentation methods to increase the variety of the dataset. The architecture reached a raw FID score of 17.65 for TiO$_2$ dataset generation, with a further reduction in FID score to nearly 10.39 by using efficient post-processing techniques. By facilitating scalable high-fidelity synthetic dataset generation, our approach can improve the effectiveness of downstream segmentation task training, overcoming severe data shortage issues in nanoparticle analysis, thus extending its applications to resource-limited fields.

* 11 pages, 9 figures, 2 tables, conference paper

Via

Access Paper or Ask Questions

Federated Self-Supervised Learning for One-Shot Cross-Modal and Cross-Imaging Technique Segmentation

Mar 30, 2025

Siladittya Manna, Suresh Das, Sayantari Ghosh, Saumik Bhattacharya

Abstract:Decentralized federated learning enables learning of data representations from multiple sources without compromising the privacy of the clients. In applications like medical image segmentation, where obtaining a large annotated dataset from a single source is a distressing problem, federated self-supervised learning can provide some solace. In this work, we push the limits further by exploring a federated self-supervised one-shot segmentation task representing a more data-scarce scenario. We adopt a pre-existing self-supervised few-shot segmentation framework CoWPro and adapt it to the federated learning scenario. To the best of our knowledge, this work is the first to attempt a self-supervised few-shot segmentation task in the federated learning domain. Moreover, we consider the clients to be constituted of data from different modalities and imaging techniques like MR or CT, which makes the problem even harder. Additionally, we reinforce and improve the baseline CoWPro method using a fused dice loss which shows considerable improvement in performance over the baseline CoWPro. Finally, we evaluate this novel framework on a completely unseen held-out part of the local client dataset. We observe that the proposed framework can achieve performance at par or better than the FedAvg version of the CoWPro framework on the held-out validation dataset.

Via

Access Paper or Ask Questions

SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images

Mar 04, 2025

Gargi Panda, Soumitra Kundu, Saumik Bhattacharya, Aurobinda Routray

Abstract:Salient object detection (SOD) in RGB-D images is an essential task in computer vision, enabling applications in scene understanding, robotics, and augmented reality. However, existing methods struggle to capture global dependency across modalities, lack comprehensive saliency priors from both RGB and depth data, and are ineffective in handling low-quality depth maps. To address these challenges, we propose SSNet, a saliency-prior and state space model (SSM)-based network for the RGB-D SOD task. Unlike existing convolution- or transformer-based approaches, SSNet introduces an SSM-based multi-modal multi-scale decoder module to efficiently capture both intra- and inter-modal global dependency with linear complexity. Specifically, we propose a cross-modal selective scan SSM (CM-S6) mechanism, which effectively captures global dependency between different modalities. Furthermore, we introduce a saliency enhancement module (SEM) that integrates three saliency priors with deep features to refine feature representation and improve the localization of salient objects. To further address the issue of low-quality depth maps, we propose an adaptive contrast enhancement technique that dynamically refines depth maps, making them more suitable for the RGB-D SOD task. Extensive quantitative and qualitative experiments on seven benchmark datasets demonstrate that SSNet outperforms state-of-the-art methods.

Via

Access Paper or Ask Questions

d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Feb 19, 2025

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

Abstract:Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.

* Accepted in The International Conference on Pattern Recognition (ICPR) 2024

Via

Access Paper or Ask Questions

Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Feb 19, 2025

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

Abstract:Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.

* 11 pages

Via

Access Paper or Ask Questions

l0-Regularized Sparse Coding-based Interpretable Network for Multi-Modal Image Fusion

Nov 07, 2024

Gargi Panda, Soumitra Kundu, Saumik Bhattacharya, Aurobinda Routray

Abstract:Multi-modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an l0-regularized multi-modal convolutional sparse coding (MCSC) model. Specifically, for solving the l0-regularized CSC problem, we develop an algorithm unrolling-based l0-regularized sparse coding (LZSC) block. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an l0-regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet's training. Extensive experiments show that FNet achieves high-quality fusion results across five different MMIF tasks. Furthermore, we show that FNet enhances downstream object detection in visible-thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network.

Via

Access Paper or Ask Questions

Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Oct 02, 2024

Arkadip Maitra, Shree Mitra, Siladittya Manna, Saumik Bhattacharya, Umapada Pal

Figure 1 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Figure 2 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Figure 3 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Figure 4 for Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Abstract:Self-supervised learning has developed rapidly over the last decade and has been applied in many areas of computer vision. Decorrelation-based self-supervised pretraining has shown great promise among non-contrastive algorithms, yielding performance at par with supervised and contrastive self-supervised baselines. In this work, we explore the decorrelation-based paradigm of self-supervised learning and apply the same to learning disentangled stroke features for writer identification. Here we propose a modified formulation of the decorrelation-based framework named SWIS which was proposed for signature verification by standardizing the features along each dimension on top of the existing framework. We show that the proposed framework outperforms the contemporary self-supervised learning framework on the writer identification benchmark and also outperforms several supervised methods as well. To the best of our knowledge, this work is the first of its kind to apply self-supervised learning for learning representations for writer verification tasks.

Via

Access Paper or Ask Questions

INN-PAR: Invertible Neural Network for PPG to ABP Reconstruction

Sep 13, 2024

Soumitra Kundu, Gargi Panda, Saumik Bhattacharya, Aurobinda Routray, Rajlakshmi Guha

Abstract:Non-invasive and continuous blood pressure (BP) monitoring is essential for the early prevention of many cardiovascular diseases. Estimating arterial blood pressure (ABP) from photoplethysmography (PPG) has emerged as a promising solution. However, existing deep learning approaches for PPG-to-ABP reconstruction (PAR) encounter certain information loss, impacting the precision of the reconstructed signal. To overcome this limitation, we introduce an invertible neural network for PPG to ABP reconstruction (INN-PAR), which employs a series of invertible blocks to jointly learn the mapping between PPG and its gradient with the ABP signal and its gradient. INN-PAR efficiently captures both forward and inverse mappings simultaneously, thereby preventing information loss. By integrating signal gradients into the learning process, INN-PAR enhances the network's ability to capture essential high-frequency details, leading to more accurate signal reconstruction. Moreover, we propose a multi-scale convolution module (MSCM) within the invertible block, enabling the model to learn features across multiple scales effectively. We have experimented on two benchmark datasets, which show that INN-PAR significantly outperforms the state-of-the-art methods in both waveform reconstruction and BP measurement accuracy.

Via

Access Paper or Ask Questions

SINET: Sparsity-driven Interpretable Neural Network for Underwater Image Enhancement

Sep 02, 2024

Gargi Panda, Soumitra Kundu, Saumik Bhattacharya, Aurobinda Routray

Figure 1 for SINET: Sparsity-driven Interpretable Neural Network for Underwater Image Enhancement

Figure 2 for SINET: Sparsity-driven Interpretable Neural Network for Underwater Image Enhancement

Figure 3 for SINET: Sparsity-driven Interpretable Neural Network for Underwater Image Enhancement

Figure 4 for SINET: Sparsity-driven Interpretable Neural Network for Underwater Image Enhancement

Abstract:Improving the quality of underwater images is essential for advancing marine research and technology. This work introduces a sparsity-driven interpretable neural network (SINET) for the underwater image enhancement (UIE) task. Unlike pure deep learning methods, our network architecture is based on a novel channel-specific convolutional sparse coding (CCSC) model, ensuring good interpretability of the underlying image enhancement process. The key feature of SINET is that it estimates the salient features from the three color channels using three sparse feature estimation blocks (SFEBs). The architecture of SFEB is designed by unrolling an iterative algorithm for solving the $\ell_1$ regulaized convolutional sparse coding (CSC) problem. Our experiments show that SINET surpasses state-of-the-art PSNR value by $1.05$ dB with $3873$ times lower computational complexity.

Via

Access Paper or Ask Questions

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Aug 27, 2024

Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós, Saumik Bhattacharya

Figure 1 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Figure 2 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Figure 3 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Figure 4 for FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Abstract:The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.

* Accepted in ICPR 2024

Via

Access Paper or Ask Questions