Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sudipta Roy

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Jun 18, 2025

Ankan Deria, Adinath Madhavrao Dukre, Feilong Tang, Sara Atito, Sudipta Roy, Muhammad Awais, Muhammad Haris Khan, Imran Razzak

Abstract:Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.

Via

Access Paper or Ask Questions

Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology

Mar 21, 2025

Devavrat Tomar, Guillaume Vray, Dwarikanath Mahapatra, Sudipta Roy, Jean-Philippe Thiran, Behzad Bozorgtabar

Abstract:In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.

* Accepted to ISBI 2025

Via

Access Paper or Ask Questions

Multimodal Fusion Learning with Dual Attention for Medical Imaging

Dec 02, 2024

Joy Dhar, Nayyar Zaidi, Maryam Haghighat, Puneet Goyal, Sudipta Roy, Azadeh Alavi, Vikas Kumar

Abstract:Multimodal fusion learning has shown significant promise in classifying various diseases such as skin cancer and brain tumors. However, existing methods face three key limitations. First, they often lack generalizability to other diagnosis tasks due to their focus on a particular disease. Second, they do not fully leverage multiple health records from diverse modalities to learn robust complementary information. And finally, they typically rely on a single attention mechanism, missing the benefits of multiple attention strategies within and across various modalities. To address these issues, this paper proposes a dual robust information fusion attention mechanism (DRIFA) that leverages two attention modules, i.e. multi-branch fusion attention module and the multimodal information fusion attention module. DRIFA can be integrated with any deep neural network, forming a multimodal fusion learning framework denoted as DRIFA-Net. We show that the multi-branch fusion attention of DRIFA learns enhanced representations for each modality, such as dermoscopy, pap smear, MRI, and CT-scan, whereas multimodal information fusion attention module learns more refined multimodal shared representations, improving the network's generalization across multiple tasks and enhancing overall performance. Additionally, to estimate the uncertainty of DRIFA-Net predictions, we have employed an ensemble Monte Carlo dropout strategy. Extensive experiments on five publicly available datasets with diverse modalities demonstrate that our approach consistently outperforms state-of-the-art methods. The code is available at https://github.com/misti1203/DRIFA-Net.

* IEEE/CVF Winter Conference on Applications of Computer Vision WACV 2025
* 10 pages

Via

Access Paper or Ask Questions

Autonomous on-Demand Shuttles for First Mile-Last Mile Connectivity: Design, Optimization, and Impact Assessment

Aug 15, 2024

Sudipta Roy, Gabriel Dadashev, Lampros Yfantis, Bat-hen Nahmias-Biran, Samiul Hasan

Figure 1 for Autonomous on-Demand Shuttles for First Mile-Last Mile Connectivity: Design, Optimization, and Impact Assessment

Figure 2 for Autonomous on-Demand Shuttles for First Mile-Last Mile Connectivity: Design, Optimization, and Impact Assessment

Figure 3 for Autonomous on-Demand Shuttles for First Mile-Last Mile Connectivity: Design, Optimization, and Impact Assessment

Figure 4 for Autonomous on-Demand Shuttles for First Mile-Last Mile Connectivity: Design, Optimization, and Impact Assessment

Abstract:The First-Mile Last-Mile (FMLM) connectivity is crucial for improving public transit accessibility and efficiency, particularly in sprawling suburban regions where traditional fixed-route transit systems are often inadequate. Autonomous on-Demand Shuttles (AODS) hold a promising option for FMLM connections due to their cost-effectiveness and improved safety features, thereby enhancing user convenience and reducing reliance on personal vehicles. A critical issue in AODS service design is the optimization of travel paths, for which realistic traffic network assignment combined with optimal routing offers a viable solution. In this study, we have designed an AODS controller that integrates a mesoscopic simulation-based dynamic traffic assignment model with a greedy insertion heuristics approach to optimize the travel routes of the shuttles. The controller also considers the charging infrastructure/strategies and the impact of the shuttles on regular traffic flow for routes and fleet-size planning. The controller is implemented in Aimsun traffic simulator considering Lake Nona in Orlando, Florida as a case study. We show that, under the present demand based on 1% of total trips as transit riders, a fleet of 3 autonomous shuttles can serve about 80% of FMLM trip requests on-demand basis with an average waiting time below 4 minutes. Additional power sources have significant effect on service quality as the inactive waiting time for charging would increase the fleet size. We also show that low-speed autonomous shuttles would have negligible impact on regular vehicle flow, making them suitable for suburban areas. These findings have important implications for sustainable urban planning and public transit operations.

* 25 Pages, 13 Figures, 1 Table

Via

Access Paper or Ask Questions

Bengali Handwritten Character Classification using Transfer Learning on Deep Convolutional Neural Network

Feb 25, 2019

Swagato Chatterjee, Rwik Kumar Dutta, Debayan Ganguly, Kingshuk Chatterjee, Sudipta Roy

Abstract:In this paper, we propose a solution which uses state-of-the-art techniques in Deep Learning to tackle the problem of Bengali Handwritten Character Recognition ( HCR ). Our method uses lesser iterations to train than most other comparable methods. We employ Transfer Learning on ResNet 50, a state-of-the-art deep Convolutional Neural Network Model, pretrained on ImageNet dataset. We also use other techniques like a modified version of One Cycle Policy, varying the input image sizes etc. to ensure that our training occurs fast. We use the BanglaLekha-Isolated Dataset for evaluation of our technique which consists of 84 classes (50 Basic, 10 Numerals and 24 Compound Characters). We are able to achieve 96.12% accuracy in just 47 epochs on BanglaLekha-Isolated dataset. When comparing our method with that of other researchers, considering number of classes and without using Ensemble Learning, the proposed solution achieves state of the art result for Handwritten Bengali Character Recognition. Code and weight files are available at https://github.com/swagato-c/bangla-hwcr-present.

Via

Access Paper or Ask Questions

A Review on Automated Brain Tumor Detection and Segmentation from MRI of Brain

Dec 16, 2013

Sudipta Roy, Sanjay Nag, Indra Kanta Maitra, Samir Kumar Bandyopadhyay

Figure 1 for A Review on Automated Brain Tumor Detection and Segmentation from MRI of Brain

Figure 2 for A Review on Automated Brain Tumor Detection and Segmentation from MRI of Brain

Figure 3 for A Review on Automated Brain Tumor Detection and Segmentation from MRI of Brain

Figure 4 for A Review on Automated Brain Tumor Detection and Segmentation from MRI of Brain

Abstract:Tumor segmentation from magnetic resonance imaging (MRI) data is an important but time consuming manual task performed by medical experts. Automating this process is a challenging task because of the high diversity in the appearance of tumor tissues among different patients and in many cases similarity with the normal tissues. MRI is an advanced medical imaging technique providing rich information about the human soft-tissue anatomy. There are different brain tumor detection and segmentation methods to detect and segment a brain tumor from MRI images. These detection and segmentation approaches are reviewed with an importance placed on enlightening the advantages and drawbacks of these methods for brain tumor detection and segmentation. The use of MRI image detection and segmentation in different procedures are also described. Here a brief review of different segmentation for detection of brain tumor from MRI of brain has been discussed.

* International Journal of Advanced Research in Computer Science and Software Engineering, 2013
* 30 figures. arXiv admin note: text overlap with arXiv:1205.6572 by other authors

Via

Access Paper or Ask Questions

Identifications of concealed weapon in a Human Body

Oct 20, 2012

Prof. Samir K. Bandyopadhyay, Biswajita Datta, Sudipta Roy

Figure 1 for Identifications of concealed weapon in a Human Body

Figure 2 for Identifications of concealed weapon in a Human Body

Figure 3 for Identifications of concealed weapon in a Human Body

Figure 4 for Identifications of concealed weapon in a Human Body

Abstract:The detection of weapons concealed underneath a person cloths is very much important to the improvement of the security of the public as well as the safety of public assets like airports, buildings and railway stations etc.

* 6 pages, International Journal of Scientific & Engineering Research (ISSN 2229-5518) 2012

Via

Access Paper or Ask Questions

A New Local Adaptive Thresholding Technique in Binarization

Jan 25, 2012

T. Romen Singh, Sudipta Roy, O. Imocha Singh, Tejmani Sinam, Kh. Manglem Singh

Figure 1 for A New Local Adaptive Thresholding Technique in Binarization

Figure 2 for A New Local Adaptive Thresholding Technique in Binarization

Figure 3 for A New Local Adaptive Thresholding Technique in Binarization

Figure 4 for A New Local Adaptive Thresholding Technique in Binarization

Abstract:Image binarization is the process of separation of pixel values into two groups, white as background and black as foreground. Thresholding plays a major in binarization of images. Thresholding can be categorized into global thresholding and local thresholding. In images with uniform contrast distribution of background and foreground like document images, global thresholding is more appropriate. In degraded document images, where considerable background noise or variation in contrast and illumination exists, there exists many pixels that cannot be easily classified as foreground or background. In such cases, binarization with local thresholding is more appropriate. This paper describes a locally adaptive thresholding technique that removes background by using local mean and mean deviation. Normally the local mean computational time depends on the window size. Our technique uses integral sum image as a prior processing to calculate local mean. It does not involve calculations of standard deviations as in other local adaptive techniques. This along with the fact that calculations of mean is independent of window size speed up the process as compared to other local thresholding techniques.

* IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 6, No 2, (2011) 271-277
* ISSN (Online): 1694-0814 http://www.IJCSI.org 271

Via

Access Paper or Ask Questions