Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noel C. F. Codella

From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification

May 16, 2025

Xue Li, Jameson Merkow, Noel C. F. Codella, Alberto Santamaria-Pang, Naiteek Sangani, Alexander Ersoy, Christopher Burt, John W. Garrett, Richard J. Bruce, Joshua D. Warner(+4 more)

Abstract:Foundation models, pretrained on extensive datasets, have significantly advanced machine learning by providing robust and transferable embeddings applicable to various domains, including medical imaging diagnostics. This study evaluates the utility of embeddings derived from both general-purpose and medical domain-specific foundation models for training lightweight adapter models in multi-class radiography classification, focusing specifically on tube placement assessment. A dataset comprising 8842 radiographs classified into seven distinct categories was employed to extract embeddings using six foundation models: DenseNet121, BiomedCLIP, Med-Flamingo, MedImageInsight, Rad-DINO, and CXR-Foundation. Adapter models were subsequently trained using classical machine learning algorithms. Among these combinations, MedImageInsight embeddings paired with an support vector machine adapter yielded the highest mean area under the curve (mAUC) at 93.8%, followed closely by Rad-DINO (91.1%) and CXR-Foundation (89.0%). In comparison, BiomedCLIP and DenseNet121 exhibited moderate performance with mAUC scores of 83.0% and 81.8%, respectively, whereas Med-Flamingo delivered the lowest performance at 75.1%. Notably, most adapter models demonstrated computational efficiency, achieving training within one minute and inference within seconds on CPU, underscoring their practicality for clinical applications. Furthermore, fairness analyses on adapters trained on MedImageInsight-derived embeddings indicated minimal disparities, with gender differences in performance within 2% and standard deviations across age groups not exceeding 3%. These findings confirm that foundation model embeddings-especially those from MedImageInsight-facilitate accurate, computationally efficient, and equitable diagnostic classification using lightweight adapters for radiographic image analysis.

* 11 pages, 5 figures, 4 tables

Via

Access Paper or Ask Questions

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Oct 09, 2024

Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaria-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang(+21 more)

Figure 1 for MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Figure 2 for MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Figure 3 for MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Figure 4 for MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Abstract:In this work, we present MedImageInsight, an open-source medical imaging embedding model. MedImageInsight is trained on medical images with associated text and labels across a diverse collection of domains, including X-Ray, CT, MRI, dermoscopy, OCT, fundus photography, ultrasound, histopathology, and mammography. Rigorous evaluations demonstrate MedImageInsight's ability to achieve state-of-the-art (SOTA) or human expert level performance across classification, image-image search, and fine-tuning tasks. Specifically, on public datasets, MedImageInsight achieves SOTA in CT 3D medical image retrieval, as well as SOTA in disease classification and search for chest X-ray, dermatology, and OCT imaging. Furthermore, MedImageInsight achieves human expert performance in bone age estimation (on both public and partner data), as well as AUC above 0.9 in most other domains. When paired with a text decoder, MedImageInsight achieves near SOTA level single image report findings generation with less than 10\% the parameters of other models. Compared to fine-tuning GPT-4o with only MIMIC-CXR data for the same task, MedImageInsight outperforms in clinical metrics, but underperforms on lexical metrics where GPT-4o sets a new SOTA. Importantly for regulatory purposes, MedImageInsight can generate ROC curves, adjust sensitivity and specificity based on clinical need, and provide evidence-based decision support through image-image search (which can also enable retrieval augmented generation). In an independent clinical evaluation of image-image search in chest X-ray, MedImageInsight outperformed every other publicly available foundation model evaluated by large margins (over 6 points AUC), and significantly outperformed other models in terms of AI fairness (across age and gender). We hope releasing MedImageInsight will help enhance collective progress in medical imaging AI research and development.

Via

Access Paper or Ask Questions

A New Benchmark for Evaluation of Cross-Domain Few-Shot Learning

Dec 16, 2019

Yunhui Guo, Noel C. F. Codella, Leonid Karlinsky, John R. Smith, Tajana Rosing, Rogerio Feris

Figure 1 for A New Benchmark for Evaluation of Cross-Domain Few-Shot Learning

Figure 2 for A New Benchmark for Evaluation of Cross-Domain Few-Shot Learning

Figure 3 for A New Benchmark for Evaluation of Cross-Domain Few-Shot Learning

Figure 4 for A New Benchmark for Evaluation of Cross-Domain Few-Shot Learning

Abstract:Recent progress on few-shot learning has largely re-lied on annotated data for meta-learning, sampled from the same domain as the novel classes. However, in many applications, collecting data for meta-learning is infeasible or impossible. This leads to the cross-domain few-shot learn-ing problem, where a large domain shift exists between base and novel classes. Although some preliminary investigation of the few-shot methods under domain shift exists, a standard benchmark for cross-domain few-shot learning is not yet established. In this paper, we propose the cross-domain few-shot learning (CD-FSL) benchmark, consist-ing of images from diverse domains with varying similarity to ImageNet, ranging from crop disease images, satellite images, and medical images. Extensive experiments on the proposed benchmark are performed to compare an array of state-of-art meta-learning and transfer learning approaches, including various forms of single model fine-tuning and ensemble learning. The results demonstrate that current meta-learning methods underperform in relation to simple fine-tuning by 12.8% average accuracy. Accuracy of all methods tend to correlate with dataset similarity toImageNet. In addition, the relative performance gain with increasing number of shots is greater with transfer methods compared to meta-learning. Finally, we demonstrate that transferring from multiple pretrained models achieves best performance, with accuracy improvements of 14.9% and 1.9% versus the best of meta-learning and single model fine-tuning approaches, respectively. In summary, the proposed benchmark serves as a challenging platform to guide future research on cross-domain few-shot learning due to its spectrum of diversity and coverage

Via

Access Paper or Ask Questions

Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets

Oct 29, 2019

Newton M. Kinyanjui, Timothy Odonga, Celia Cintas, Noel C. F. Codella, Rameswar Panda, Prasanna Sattigeri, Kush R. Varshney

Figure 1 for Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets

Figure 2 for Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets

Figure 3 for Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets

Figure 4 for Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets

Abstract:Recent advances in computer vision and deep learning have led to breakthroughs in the development of automated skin image analysis. In particular, skin cancer classification models have achieved performance higher than trained expert dermatologists. However, no attempt has been made to evaluate the consistency in performance of machine learning models across populations with varying skin tones. In this paper, we present an approach to estimate skin tone in benchmark skin disease datasets, and investigate whether model performance is dependent on this measure. Specifically, we use individual typology angle (ITA) to approximate skin tone in dermatology datasets. We look at the distribution of ITA values to better understand skin color representation in two benchmark datasets: 1) the ISIC 2018 Challenge dataset, a collection of dermoscopic images of skin lesions for the detection of skin cancer, and 2) the SD-198 dataset, a collection of clinical images capturing a wide variety of skin diseases. To estimate ITA, we first develop segmentation models to isolate non-diseased areas of skin. We find that the majority of the data in the the two datasets have ITA values between 34.5{\deg} and 48{\deg}, which are associated with lighter skin, and is consistent with under-representation of darker skinned populations in these datasets. We also find no measurable correlation between performance of machine learning model and ITA values, though more comprehensive data is needed for further validation.

* NeurIPS 2019 Workshop on Fair ML for Health

Via

Access Paper or Ask Questions

BCN20000: Dermoscopic Lesions in the Wild

Aug 30, 2019

Marc Combalia, Noel C. F. Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Allan C. Halpern, Susana Puig(+1 more)

Figure 1 for BCN20000: Dermoscopic Lesions in the Wild

Figure 2 for BCN20000: Dermoscopic Lesions in the Wild

Abstract:This article summarizes the BCN20000 dataset, composed of 19424 dermoscopic images of skin lesions captured from 2010 to 2016 in the facilities of the Hospital Cl\'inic in Barcelona. With this dataset, we aim to study the problem of unconstrained classification of dermoscopic images of skin cancer, including lesions found in hard-to-diagnose locations (nails and mucosa), large lesions which do not fit in the aperture of the dermoscopy device, and hypo-pigmented lesions. The BCN20000 will be provided to the participants of the ISIC Challenge 2019, where they will be asked to train algorithms to classify dermoscopic images of skin cancer automatically.

* Abstract for BCN20000

Via

Access Paper or Ask Questions

Teaching AI to Explain its Decisions Using Embeddings and Multi-Task Learning

Jun 05, 2019

Noel C. F. Codella, Michael Hind, Karthikeyan Natesan Ramamurthy, Murray Campbell, Amit Dhurandhar, Kush R. Varshney, Dennis Wei, Aleksandra Mojsilović

Figure 1 for Teaching AI to Explain its Decisions Using Embeddings and Multi-Task Learning

Figure 2 for Teaching AI to Explain its Decisions Using Embeddings and Multi-Task Learning

Abstract:Using machine learning in high-stakes applications often requires predictions to be accompanied by explanations comprehensible to the domain user, who has ultimate responsibility for decisions and outcomes. Recently, a new framework for providing explanations, called TED, has been proposed to provide meaningful explanations for predictions. This framework augments training data to include explanations elicited from domain users, in addition to features and labels. This approach ensures that explanations for predictions are tailored to the complexity expectations and domain knowledge of the consumer. In this paper, we build on this foundational work, by exploring more sophisticated instantiations of the TED framework and empirically evaluate their effectiveness in two diverse domains, chemical odor and skin cancer prediction. Results demonstrate that meaningful explanations can be reliably taught to machine learning algorithms, and in some cases, improving modeling accuracy.

* presented at 2019 ICML Workshop on Human in the Loop Learning (HILL 2019), Long Beach, USA. arXiv admin note: substantial text overlap with arXiv:1805.11648

Via

Access Paper or Ask Questions

TED: Teaching AI to Explain its Decisions

Nov 12, 2018

Noel C. F. Codella, Michael Hind, Karthikeyan Natesan Ramamurthy, Murray Campbell, Amit Dhurandhar, Kush R. Varshney, Dennis Wei, Aleksandra Mojsilovic

Figure 1 for TED: Teaching AI to Explain its Decisions

Figure 2 for TED: Teaching AI to Explain its Decisions

Figure 3 for TED: Teaching AI to Explain its Decisions

Abstract:Artificial intelligence systems are being increasingly deployed due to their potential to increase the efficiency, scale, consistency, fairness, and accuracy of decisions. However, as many of these systems are opaque in their operation, there is a growing demand for such systems to provide explanations for their decisions. Conventional approaches to this problem attempt to expose or discover the inner workings of a machine learning model with the hope that the resulting explanations will be meaningful to the consumer. In contrast, this paper suggests a new approach to this problem. It introduces a simple, practical framework, called Teaching Explanations for Decisions (TED), that provides meaningful explanations that match the mental model of the consumer. We illustrate the generality and effectiveness of this approach with two different examples, resulting in highly accurate explanations with no loss of prediction accuracy for these two examples.

* This article leverages some content from arXiv:1805.11648

Via

Access Paper or Ask Questions

Teaching Meaningful Explanations

Sep 11, 2018

Noel C. F. Codella, Michael Hind, Karthikeyan Natesan Ramamurthy, Murray Campbell, Amit Dhurandhar, Kush R. Varshney, Dennis Wei, Aleksandra Mojsilovic

Figure 1 for Teaching Meaningful Explanations

Figure 2 for Teaching Meaningful Explanations

Figure 3 for Teaching Meaningful Explanations

Abstract:The adoption of machine learning in high-stakes applications such as healthcare and law has lagged in part because predictions are not accompanied by explanations comprehensible to the domain user, who often holds the ultimate responsibility for decisions and outcomes. In this paper, we propose an approach to generate such explanations in which training data is augmented to include, in addition to features and labels, explanations elicited from domain users. A joint model is then learned to produce both labels and explanations from the input features. This simple idea ensures that explanations are tailored to the complexity expectations and domain knowledge of the consumer. Evaluation spans multiple modeling techniques on a game dataset, a (visual) aesthetics dataset, a chemical odor dataset and a Melanoma dataset showing that our approach is generalizable across domains and algorithms. Results demonstrate that meaningful explanations can be reliably taught to machine learning algorithms, and in some cases, also improve modeling accuracy.

* 9 pages

Via

Access Paper or Ask Questions

Collaborative Human-AI (CHAI): Evidence-Based Interpretable Melanoma Classification in Dermoscopic Images

Aug 01, 2018

Noel C. F. Codella, Chung-Ching Lin, Allan Halpern, Michael Hind, Rogerio Feris, John R. Smith

Figure 1 for Collaborative Human-AI (CHAI): Evidence-Based Interpretable Melanoma Classification in Dermoscopic Images

Figure 2 for Collaborative Human-AI (CHAI): Evidence-Based Interpretable Melanoma Classification in Dermoscopic Images

Figure 3 for Collaborative Human-AI (CHAI): Evidence-Based Interpretable Melanoma Classification in Dermoscopic Images

Figure 4 for Collaborative Human-AI (CHAI): Evidence-Based Interpretable Melanoma Classification in Dermoscopic Images

Abstract:Automated dermoscopic image analysis has witnessed rapid growth in diagnostic performance. Yet adoption faces resistance, in part, because no evidence is provided to support decisions. In this work, an approach for evidence-based classification is presented. A feature embedding is learned with CNNs, triplet-loss, and global average pooling, and used to classify via kNN search. Evidence is provided as both the discovered neighbors, as well as localized image regions most relevant to measuring distance between query and neighbors. To ensure that results are relevant in terms of both label accuracy and human visual similarity for any skill level, a novel hierarchical triplet logic is implemented to jointly learn an embedding according to disease labels and non-expert similarity. Results are improved over baselines trained on disease labels alone, as well as standard multiclass loss. Quantitative relevance of results, according to non-expert similarity, as well as localized image regions, are also significantly improved.

* Presented at MICCAI 2018, Workshop on Interpretability of Machine Intelligence in Medical Image Computing (IMIMIC): https://imimic.bitbucket.io

Via

Access Paper or Ask Questions

Segmentation of both Diseased and Healthy Skin from Clinical Photographs in a Primary Care Setting

Apr 18, 2018

Noel C. F. Codella, Daren Anderson, Tyler Philips, Anthony Porto, Kevin Massey, Jane Snowdon, Rogerio Feris, John Smith

Figure 1 for Segmentation of both Diseased and Healthy Skin from Clinical Photographs in a Primary Care Setting

Figure 2 for Segmentation of both Diseased and Healthy Skin from Clinical Photographs in a Primary Care Setting

Figure 3 for Segmentation of both Diseased and Healthy Skin from Clinical Photographs in a Primary Care Setting

Figure 4 for Segmentation of both Diseased and Healthy Skin from Clinical Photographs in a Primary Care Setting

Abstract:This work presents the first segmentation study of both diseased and healthy skin in standard camera photographs from a clinical environment. Challenges arise from varied lighting conditions, skin types, backgrounds, and pathological states. For study, 400 clinical photographs (with skin segmentation masks) representing various pathological states of skin are retrospectively collected from a primary care network. 100 images are used for training and fine-tuning, and 300 are used for evaluation. This distribution between training and test partitions is chosen to reflect the difficulty in amassing large quantities of labeled data in this domain. A deep learning approach is used, and 3 public segmentation datasets of healthy skin are collected to study the potential benefits of pre-training. Two variants of U-Net are evaluated: U-Net and Dense Residual U-Net. We find that Dense Residual U-Nets have a 7.8% improvement in Jaccard, compared to classical U-Net architectures (0.55 vs. 0.51 Jaccard), for direct transfer, where fine-tuning data is not utilized. However, U-Net outperforms Dense Residual U-Net for both direct training (0.83 vs. 0.80) and fine-tuning (0.89 vs. 0.88). The stark performance improvement with fine-tuning compared to direct transfer and direct training emphasizes both the need for adequate representative data of diseased skin, and the utility of other publicly available data sources for this task.

* Accepted to IEEE EMBC 2018

Via

Access Paper or Ask Questions