Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Malaikannan Sankarasubbu

ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Jun 04, 2025

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, Pranav Rajpurkar

Abstract:We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA

Via

Access Paper or Ask Questions

Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations

Feb 10, 2024

Ankit Pal, Malaikannan Sankarasubbu

Abstract:Large language models have the potential to be valuable in the healthcare industry, but it's crucial to verify their safety and effectiveness through rigorous evaluation. For this purpose, we comprehensively evaluated both open-source LLMs and Google's new multimodal LLM called Gemini across Medical reasoning, hallucination detection, and Medical Visual Question Answering tasks. While Gemini showed competence, it lagged behind state-of-the-art models like MedPaLM 2 and GPT-4 in diagnostic accuracy. Additionally, Gemini achieved an accuracy of 61.45\% on the medical VQA dataset, significantly lower than GPT-4V's score of 88\%. Our analysis revealed that Gemini is highly susceptible to hallucinations, overconfidence, and knowledge gaps, which indicate risks if deployed uncritically. We also performed a detailed analysis by medical subject and test type, providing actionable feedback for developers and clinicians. To mitigate risks, we applied prompting strategies that improved performance. Additionally, we facilitated future research and development by releasing a Python module for medical LLM evaluation and establishing a dedicated leaderboard on Hugging Face for medical domain LLMs. Python module can be found at https://github.com/promptslab/RosettaEval

* Preprint version, Under Review

Via

Access Paper or Ask Questions

Med-HALT: Medical Domain Hallucination Test for Large Language Models

Jul 28, 2023

Logesh Kumar Umapathi, Ankit Pal, Malaikannan Sankarasubbu

Figure 1 for Med-HALT: Medical Domain Hallucination Test for Large Language Models

Figure 2 for Med-HALT: Medical Domain Hallucination Test for Large Language Models

Figure 3 for Med-HALT: Medical Domain Hallucination Test for Large Language Models

Figure 4 for Med-HALT: Medical Domain Hallucination Test for Large Language Models

Abstract:This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs's problem-solving and information retrieval abilities. Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa-2, MPT, and Falcon, revealing significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility. Through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. Our benchmark can be found at medhalt.github.io

Via

Access Paper or Ask Questions

Federated Learning for Healthcare Domain - Pipeline, Applications and Challenges

Nov 19, 2022

Madhura Joshi, Ankit Pal, Malaikannan Sankarasubbu

Abstract:Federated learning is the process of developing machine learning models over datasets distributed across data centers such as hospitals, clinical research labs, and mobile devices while preventing data leakage. This survey examines previous research and studies on federated learning in the healthcare sector across a range of use cases and applications. Our survey shows what challenges, methods, and applications a practitioner should be aware of in the topic of federated learning. This paper aims to lay out existing research and list the possibilities of federated learning for healthcare industries.

* ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 40. Publication date: October 2022
* ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 40. Publication date: October 2022

Via

Access Paper or Ask Questions

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Mar 27, 2022

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu

Figure 1 for MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Figure 2 for MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Figure 3 for MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Figure 4 for MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Abstract:This paper introduces MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. More than 194k high-quality AIIMS \& NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which requires a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects \& topics. A detailed explanation of the solution, along with the above information, is provided in this study.

* ACM Conference on Health, Inference, and Learning (CHIL) 2022
* Proceedings of Machine Learning Research (PMLR), ACM Conference on Health, Inference, and Learning (CHIL) 2022

Via

Access Paper or Ask Questions

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Sep 23, 2021

Kamal Raj Kanakarajan, Bhuvana Kundumani, Malaikannan Sankarasubbu

Figure 1 for Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Figure 2 for Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Figure 3 for Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Abstract:Recent progress in the Natural Language Processing domain has given us several State-of-the-Art (SOTA) pretrained models which can be finetuned for specific tasks. These large models with billions of parameters trained on numerous GPUs/TPUs over weeks are leading in the benchmark leaderboards. In this paper, we discuss the need for a benchmark for cost and time effective smaller models trained on a single GPU. This will enable researchers with resource constraints experiment with novel and innovative ideas on tokenization, pretraining tasks, architecture, fine tuning methods etc. We set up Small-Bench NLP, a benchmark for small efficient neural language models trained on a single GPU. Small-Bench NLP benchmark comprises of eight NLP tasks on the publicly available GLUE datasets and a leaderboard to track the progress of the community. Our ELECTRA-DeBERTa (15M parameters) small model architecture achieves an average score of 81.53 which is comparable to that of BERT-Base's 82.20 (110M parameters). Our models, code and leaderboard are available at https://github.com/smallbenchnlp

Via

Access Paper or Ask Questions

Pay Attention to the cough: Early Diagnosis of COVID-19 using Interpretable Symptoms Embeddings with Cough Sound Signal Processing

Oct 12, 2020

Ankit Pal, Malaikannan Sankarasubbu

Figure 1 for Pay Attention to the cough: Early Diagnosis of COVID-19 using Interpretable Symptoms Embeddings with Cough Sound Signal Processing

Figure 2 for Pay Attention to the cough: Early Diagnosis of COVID-19 using Interpretable Symptoms Embeddings with Cough Sound Signal Processing

Figure 3 for Pay Attention to the cough: Early Diagnosis of COVID-19 using Interpretable Symptoms Embeddings with Cough Sound Signal Processing

Figure 4 for Pay Attention to the cough: Early Diagnosis of COVID-19 using Interpretable Symptoms Embeddings with Cough Sound Signal Processing

Abstract:COVID-19 (coronavirus disease 2019) pandemic caused by SARS-CoV-2 has led to a treacherous and devastating catastrophe for humanity. At the time of writing, no specific antivirus drugs or vaccines are recommended to control infection transmission and spread. The current diagnosis of COVID-19 is done by Reverse-Transcription Polymer Chain Reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and not easily available in straitened regions. An interpretable and COVID-19 diagnosis AI framework is devised and developed based on the cough sounds features and symptoms metadata to overcome these limitations. The proposed framework's performance was evaluated using a medical dataset containing Symptoms and Demographic data of 30000 audio segments, 328 cough sounds from 150 patients with four cough classes ( COVID-19, Asthma, Bronchitis, and Healthy). Experiments' results show that the model captures the better and robust feature embedding to distinguish between COVID-19 patient coughs and several types of non-COVID-19 coughs with higher specificity and accuracy of 95.04 $\pm$ 0.18% and 96.83$\pm$ 0.18% respectively, all the while maintaining interpretability.

* Preprint Version

Via

Access Paper or Ask Questions

Multi-Label Text Classification using Attention-based Graph Neural Network

Mar 22, 2020

Ankit Pal, Muru Selvakumar, Malaikannan Sankarasubbu

Figure 1 for Multi-Label Text Classification using Attention-based Graph Neural Network

Figure 2 for Multi-Label Text Classification using Attention-based Graph Neural Network

Figure 3 for Multi-Label Text Classification using Attention-based Graph Neural Network

Figure 4 for Multi-Label Text Classification using Attention-based Graph Neural Network

Abstract:In Multi-Label Text Classification (MLTC), one sample can belong to more than one class. It is observed that most MLTC tasks, there are dependencies or correlations among labels. Existing methods tend to ignore the relationship among labels. In this paper, a graph attention network-based model is proposed to capture the attentive dependency structure among the labels. The graph attention network uses a feature matrix and a correlation matrix to capture and explore the crucial dependencies between the labels and generate classifiers for the task. The generated classifiers are applied to sentence feature vectors obtained from the text feature extraction network (BiLSTM) to enable end-to-end training. Attention allows the system to assign different weights to neighbor nodes per label, thus allowing it to learn the dependencies among labels implicitly. The results of the proposed model are validated on five real-world MLTC datasets. The proposed model achieves similar or better performance compared to the previous state-of-the-art models.

* 12th International Conference on Agents and Artificial Intelligence (ICAART 2020)

Via

Access Paper or Ask Questions

Detecting Parking Spaces in a Parcel using Satellite Images

Aug 28, 2019

Murugesan Vadivel, SelvaKumar Murugan, Vaidheeswaran Archana, Malaikannan Sankarasubbu

Figure 1 for Detecting Parking Spaces in a Parcel using Satellite Images

Figure 2 for Detecting Parking Spaces in a Parcel using Satellite Images

Figure 3 for Detecting Parking Spaces in a Parcel using Satellite Images

Figure 4 for Detecting Parking Spaces in a Parcel using Satellite Images

Abstract:Remote Sensing Images from satellites have been used in various domains for detecting and understanding structures on the ground surface. In this work, satellite images were used for localizing parking spaces and vehicles in parking lots for a given parcel using an RCNN based Neural Network Architectures. Parcel shapefiles and raster images from USGS image archive were used for developing images for both training and testing. Feature Pyramid based Mask RCNN yields average class accuracy of 97.56% for both parking spaces and vehicles

Via

Access Paper or Ask Questions

Compositional Attention Networks for Interpretability in Natural Language Question Answering

Oct 30, 2018

Muru Selvakumar, Suriyadeepan Ramamoorthy, Vaidheeswaran Archana, Malaikannan Sankarasubbu

Figure 1 for Compositional Attention Networks for Interpretability in Natural Language Question Answering

Figure 2 for Compositional Attention Networks for Interpretability in Natural Language Question Answering

Figure 3 for Compositional Attention Networks for Interpretability in Natural Language Question Answering

Figure 4 for Compositional Attention Networks for Interpretability in Natural Language Question Answering

Abstract:MAC Net is a compositional attention network designed for Visual Question Answering. We propose a modified MAC net architecture for Natural Language Question Answering. Question Answering typically requires Language Understanding and multi-step Reasoning. MAC net's unique architecture - the separation between memory and control, facilitates data-driven iterative reasoning. This makes it an ideal candidate for solving tasks that involve logical reasoning. Our experiments with 20 bAbI tasks demonstrate the value of MAC net as a data-efficient and interpretable architecture for Natural Language Question Answering. The transparent nature of MAC net provides a highly granular view of the reasoning steps taken by the network in answering a query.

* 8 pages,10 figures, 1 table

Via

Access Paper or Ask Questions