Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Arman Rahmim

AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning

May 21, 2025

Morteza Alizadeh, Mehrdad Oveisi, Sonya Falahati, Ghazal Mousavi, Mohsen Alambardar Meybodi, Somayeh Sadat Mehrnia, Ilker Hacihaliloglu, Arman Rahmim, Mohammad R. Salmanpour

Abstract:Machine learning (ML) models rely heavily on consistent and accurate performance metrics to evaluate and compare their effectiveness. However, existing libraries often suffer from fragmentation, inconsistent implementations, and insufficient data validation protocols, leading to unreliable results. Existing libraries have often been developed independently and without adherence to a unified standard, particularly concerning the specific tasks they aim to support. As a result, each library tends to adopt its conventions for metric computation, input/output formatting, error handling, and data validation protocols. This lack of standardization leads to both implementation differences (ID) and reporting differences (RD), making it difficult to compare results across frameworks or ensure reliable evaluations. To address these issues, we introduce AllMetrics, an open-source unified Python library designed to standardize metric evaluation across diverse ML tasks, including regression, classification, clustering, segmentation, and image-to-image translation. The library implements class-specific reporting for multi-class tasks through configurable parameters to cover all use cases, while incorporating task-specific parameters to resolve metric computation discrepancies across implementations. Various datasets from domains like healthcare, finance, and real estate were applied to our library and compared with Python, Matlab, and R components to identify which yield similar results. AllMetrics combines a modular Application Programming Interface (API) with robust input validation mechanisms to ensure reproducibility and reliability in model evaluation. This paper presents the design principles, architectural components, and empirical analyses demonstrating the ability to mitigate evaluation errors and to enhance the trustworthiness of ML workflows.

Via

Access Paper or Ask Questions

Pathobiological Dictionary Defining Pathomics and Texture Features: Addressing Understandable AI Issues in Personalized Liver Cancer; Dictionary Version LCP1.0

May 20, 2025

Mohammad R. Salmanpour, Seyed Mohammad Piri, Somayeh Sadat Mehrnia, Ahmad Shariftabrizi, Masume Allahmoradi, Venkata SK. Manem, Arman Rahmim, Ilker Hacihaliloglu

Abstract:Artificial intelligence (AI) holds strong potential for medical diagnostics, yet its clinical adoption is limited by a lack of interpretability and generalizability. This study introduces the Pathobiological Dictionary for Liver Cancer (LCP1.0), a practical framework designed to translate complex Pathomics and Radiomics Features (PF and RF) into clinically meaningful insights aligned with existing diagnostic workflows. QuPath and PyRadiomics, standardized according to IBSI guidelines, were used to extract 333 imaging features from hepatocellular carcinoma (HCC) tissue samples, including 240 PF-based-cell detection/intensity, 74 RF-based texture, and 19 RF-based first-order features. Expert-defined ROIs from the public dataset excluded artifact-prone areas, and features were aggregated at the case level. Their relevance to the WHO grading system was assessed using multiple classifiers linked with feature selectors. The resulting dictionary was validated by 8 experts in oncology and pathology. In collaboration with 10 domain experts, we developed a Pathobiological dictionary of imaging features such as PFs and RF. In our study, the Variable Threshold feature selection algorithm combined with the SVM model achieved the highest accuracy (0.80, P-value less than 0.05), selecting 20 key features, primarily clinical and pathomics traits such as Centroid, Cell Nucleus, and Cytoplasmic characteristics. These features, particularly nuclear and cytoplasmic, were strongly associated with tumor grading and prognosis, reflecting atypia indicators like pleomorphism, hyperchromasia, and cellular orientation.The LCP1.0 provides a clinically validated bridge between AI outputs and expert interpretation, enhancing model transparency and usability. Aligning AI-derived features with clinical semantics supports the development of interpretable, trustworthy diagnostic tools for liver cancer pathology.

* 29 pages, 4 figures and 1 table

Via

Access Paper or Ask Questions

Comprehensive Evaluation of Quantitative Measurements from Automated Deep Segmentations of PSMA PET/CT Images

Apr 22, 2025

Obed Korshie Dzikunu, Amirhossein Toosi, Shadab Ahamed, Sara Harsini, Francois Benard, Xiaoxiao Li, Arman Rahmim

Abstract:This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: https://github.com/ObedDzik/pca\_segment.git.

* 12 pages, 8 figures

Via

Access Paper or Ask Questions

Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images

Feb 04, 2025

Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim

Figure 1 for Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images

Figure 2 for Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images

Figure 3 for Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images

Figure 4 for Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images

Abstract:This study proposes a new loss function for deep neural networks, L1-weighted Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of voxels based on their classification difficulty, towards automated detection and segmentation of metastatic prostate cancer lesions in PET/CT scans. We obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with biochemical recurrence metastatic prostate cancer. We trained two 3D convolutional neural networks, Attention U-Net and SegResNet, and concatenated the PET and CT volumes channel-wise as input. The performance of our custom loss function was evaluated against the Dice and Dice Focal Loss functions. For clinical significance, we considered a detected region of interest (ROI) as a true positive if at least the voxel with the maximum standardized uptake value falls within the ROI. We assessed the models' performance based on the number of lesions in an image, tumour volume, activity, and extent of spread. The L1DFL outperformed the comparative loss functions by at least 13% on the test set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal Loss yielded more false positives, whereas the Dice Loss was more sensitive to smaller volumes and struggled to segment larger lesions accurately. They also exhibited network-specific variations and yielded declines in segmentation accuracy with increased tumour spread. Our results demonstrate the potential of L1DFL to yield robust segmentation of metastatic prostate cancer lesions in PSMA PET/CT images. The results further highlight potential complexities arising from the variations in lesion characteristics that may influence automated prostate cancer tumour detection and segmentation. The code is publicly available at: https://github.com/ObedDzik/pca_segment.git.

* 29 pages, 7 figures, 1 table

Via

Access Paper or Ask Questions

Biological and Radiological Dictionary of Radiomics Features: Addressing Understandable AI Issues in Personalized Prostate Cancer; Dictionary version PM1.0

Dec 14, 2024

Mohammad R. Salmanpour, Sajad Amiri, Sara Gharibi, Ahmad Shariftabrizi, Yixi Xu, William B Weeks, Arman Rahmim, Ilker Hacihaliloglu

Abstract:This study investigates the connection between visual semantic features in PI-RADS and associated risk factors, moving beyond abnormal imaging findings by creating a standardized dictionary of biological/radiological radiomics features (RFs). Using multiparametric prostate MRI sequences (T2-weighted imaging [T2WI], diffusion-weighted imaging [DWI], and apparent diffusion coefficient [ADC]), six interpretable and seven complex classifiers, combined with nine feature selection algorithms (FSAs), were applied to segmented lesions to predict UCLA scores. Combining T2WI, DWI, and ADC with FSAs such as ANOVA F-test, Correlation Coefficient, and Fisher Score, and utilizing logistic regression, identified key features: the 90th percentile from T2WI (hypo-intensity linked to cancer risk), variance from T2WI (lesion heterogeneity), shape metrics like Least Axis Length and Surface Area to Volume ratio from ADC (lesion compactness), and Run Entropy from ADC (texture consistency). This approach achieved an average accuracy of 0.78, outperforming single-sequence methods (p < 0.05). The developed dictionary provides a common language, fostering collaboration between clinical professionals and AI developers to enable trustworthy, interpretable AI for reliable clinical decisions.

* 24 pages, 3 Figures, 2 Tables

Via

Access Paper or Ask Questions

Enhanced Lung Cancer Survival Prediction using Semi-Supervised Pseudo-Labeling and Learning from Diverse PET/CT Datasets

Nov 25, 2024

Mohammad R. Salmanpour, Arman Gorji, Amin Mousavi, Ali Fathi Jouzdani, Nima Sanati, Mehdi Maghsudi, Bonnie Leung, Cheryl Ho, Ren Yuan, Arman Rahmim

Figure 1 for Enhanced Lung Cancer Survival Prediction using Semi-Supervised Pseudo-Labeling and Learning from Diverse PET/CT Datasets

Figure 2 for Enhanced Lung Cancer Survival Prediction using Semi-Supervised Pseudo-Labeling and Learning from Diverse PET/CT Datasets

Figure 3 for Enhanced Lung Cancer Survival Prediction using Semi-Supervised Pseudo-Labeling and Learning from Diverse PET/CT Datasets

Figure 4 for Enhanced Lung Cancer Survival Prediction using Semi-Supervised Pseudo-Labeling and Learning from Diverse PET/CT Datasets

Abstract:Objective: This study explores a semi-supervised learning (SSL), pseudo-labeled strategy using diverse datasets to enhance lung cancer (LCa) survival predictions, analyzing Handcrafted and Deep Radiomic Features (HRF/DRF) from PET/CT scans with Hybrid Machine Learning Systems (HMLS). Methods: We collected 199 LCa patients with both PET & CT images, obtained from The Cancer Imaging Archive (TCIA) and our local database, alongside 408 head&neck cancer (HNCa) PET/CT images from TCIA. We extracted 215 HRFs and 1024 DRFs by PySERA and a 3D-Autoencoder, respectively, within the ViSERA software, from segmented primary tumors. The supervised strategy (SL) employed a HMLSs: PCA connected with 4 classifiers on both HRF and DRFs. SSL strategy expanded the datasets by adding 408 pseudo-labeled HNCa cases (labeled by Random Forest algorithm) to 199 LCa cases, using the same HMLSs techniques. Furthermore, Principal Component Analysis (PCA) linked with 4 survival prediction algorithms were utilized in survival hazard ratio analysis. Results: SSL strategy outperformed SL method (p-value<0.05), achieving an average accuracy of 0.85 with DRFs from PET and PCA+ Multi-Layer Perceptron (MLP), compared to 0.65 for SL strategy using DRFs from CT and PCA+ K-Nearest Neighbor (KNN). Additionally, PCA linked with Component-wise Gradient Boosting Survival Analysis on both HRFs and DRFs, as extracted from CT, had an average c-index of 0.80 with a Log Rank p-value<<0.001, confirmed by external testing. Conclusions: Shifting from HRFs and SL to DRFs and SSL strategies, particularly in contexts with limited data points, enabling CT or PET alone to significantly achieve high predictive performance.

* 12 pages and 7 figures

Via

Access Paper or Ask Questions

Machine Learning Evaluation Metric Discrepancies across Programming Languages and Their Components: Need for Standardization

Nov 18, 2024

Mohammad R. Salmanpour, Morteza Alizadeh, Ghazal Mousavi, Saba Sadeghi, Sajad Amiri, Mehrdad Oveisi, Arman Rahmim, Ilker Hacihaliloglu

Figure 1 for Machine Learning Evaluation Metric Discrepancies across Programming Languages and Their Components: Need for Standardization

Figure 2 for Machine Learning Evaluation Metric Discrepancies across Programming Languages and Their Components: Need for Standardization

Figure 3 for Machine Learning Evaluation Metric Discrepancies across Programming Languages and Their Components: Need for Standardization

Figure 4 for Machine Learning Evaluation Metric Discrepancies across Programming Languages and Their Components: Need for Standardization

Abstract:This study evaluates metrics for tasks such as classification, regression, clustering, correlation analysis, statistical tests, segmentation, and image-to-image (I2I) translation. Metrics were compared across Python libraries, R packages, and Matlab functions to assess their consistency and highlight discrepancies. The findings underscore the need for a unified roadmap to standardize metrics, ensuring reliable and reproducible ML evaluations across platforms. This study examined a wide range of evaluation metrics across various tasks and found only some to be consistent across platforms, such as (i) Accuracy, Balanced Accuracy, Cohens Kappa, F-beta Score, MCC, Geometric Mean, AUC, and Log Loss in binary classification; (ii) Accuracy, Cohens Kappa, and F-beta Score in multi-class classification; (iii) MAE, MSE, RMSE, MAPE, Explained Variance, Median AE, MSLE, and Huber in regression; (iv) Davies-Bouldin Index and Calinski-Harabasz Index in clustering; (v) Pearson, Spearman, Kendall's Tau, Mutual Information, Distance Correlation, Percbend, Shepherd, and Partial Correlation in correlation analysis; (vi) Paired t-test, Chi-Square Test, ANOVA, Kruskal-Wallis Test, Shapiro-Wilk Test, Welchs t-test, and Bartlett's test in statistical tests; (vii) Accuracy, Precision, and Recall in 2D segmentation; (viii) Accuracy in 3D segmentation; (ix) MAE, MSE, RMSE, and R-Squared in 2D-I2I translation; and (x) MAE, MSE, and RMSE in 3D-I2I translation. Given observation of discrepancies in a number of metrics (e.g. precision, recall and F1 score in binary classification, WCSS in clustering, multiple statistical tests, and IoU in segmentation, amongst multiple metrics), this study concludes that ML evaluation metrics require standardization and recommends that future research use consistent metrics for different tasks to effectively compare ML techniques and solutions.

* This paper is 12 pages with 1 table and 10 figures

Via

Access Paper or Ask Questions

Prognosis of COVID-19 using Artificial Intelligence: A Systematic Review and Meta-analysis

Aug 01, 2024

SaeedReza Motamedian, Sadra Mohaghegh, Elham Babadi Oregani, Mahrsa Amjadi, Parnian Shobeiri, Negin Cheraghi, Niusha Solouki, Nikoo Ahmadi, Hossein Mohammad-Rahimi, Yassine Bouchareb(+1 more)

Figure 1 for Prognosis of COVID-19 using Artificial Intelligence: A Systematic Review and Meta-analysis

Figure 2 for Prognosis of COVID-19 using Artificial Intelligence: A Systematic Review and Meta-analysis

Figure 3 for Prognosis of COVID-19 using Artificial Intelligence: A Systematic Review and Meta-analysis

Figure 4 for Prognosis of COVID-19 using Artificial Intelligence: A Systematic Review and Meta-analysis

Abstract:Purpose: Artificial intelligence (AI) techniques have been extensively utilized for diagnosing and prognosis of several diseases in recent years. This study identifies, appraises and synthesizes published studies on the use of AI for the prognosis of COVID-19. Method: Electronic search was performed using Medline, Google Scholar, Scopus, Embase, Cochrane and ProQuest. Studies that examined machine learning or deep learning methods to determine the prognosis of COVID-19 using CT or chest X-ray images were included. Polled sensitivity, specificity area under the curve and diagnostic odds ratio were calculated. Result: A total of 36 articles were included; various prognosis-related issues, including disease severity, mechanical ventilation or admission to the intensive care unit and mortality, were investigated. Several AI models and architectures were employed, such as the Siamense model, support vector machine, Random Forest , eXtreme Gradient Boosting, and convolutional neural networks. The models achieved 71%, 88% and 67% sensitivity for mortality, severity assessment and need for ventilation, respectively. The specificity of 69%, 89% and 89% were reported for the aforementioned variables. Conclusion: Based on the included articles, machine learning and deep learning methods used for the prognosis of COVID-19 patients using radiomic features from CT or CXR images can help clinicians manage patients and allocate resources more effectively. These studies also demonstrate that combining patient demographic, clinical data, laboratory tests and radiomic features improves model performances.

Via

Access Paper or Ask Questions

How To Segment in 3D Using 2D Models: Automated 3D Segmentation of Prostate Cancer Metastatic Lesions on PET Volumes Using Multi-Angle Maximum Intensity Projections and Diffusion Models

Jul 26, 2024

Amirhosein Toosi, Sara Harsini, François Bénard, Carlos Uribe, Arman Rahmim

Abstract:Prostate specific membrane antigen (PSMA) positron emission tomography/computed tomography (PET/CT) imaging provides a tremendously exciting frontier in visualization of prostate cancer (PCa) metastatic lesions. However, accurate segmentation of metastatic lesions is challenging due to low signal-to-noise ratios and variable sizes, shapes, and locations of the lesions. This study proposes a novel approach for automated segmentation of metastatic lesions in PSMA PET/CT 3D volumetric images using 2D denoising diffusion probabilistic models (DDPMs). Instead of 2D trans-axial slices or 3D volumes, the proposed approach segments the lesions on generated multi-angle maximum intensity projections (MA-MIPs) of the PSMA PET images, then obtains the final 3D segmentation masks from 3D ordered subset expectation maximization (OSEM) reconstruction of 2D MA-MIPs segmentations. Our proposed method achieved superior performance compared to state-of-the-art 3D segmentation approaches in terms of accuracy and robustness in detecting and segmenting small metastatic PCa lesions. The proposed method has significant potential as a tool for quantitative analysis of metastatic burden in PCa patients.

* 11 pages, 2 figures, accepted in the DGM4MICCAI workshop, MICCAI, 2024

Via

Access Paper or Ask Questions

Thyroidiomics: An Automated Pipeline for Segmentation and Classification of Thyroid Pathologies from Scintigraphy Images

Jul 14, 2024

Maziar Sabouri, Shadab Ahamed, Azin Asadzadeh, Atlas Haddadi Avval, Soroush Bagheri, Mohsen Arabi, Seyed Rasoul Zakavi, Emran Askari, Ali Rasouli, Atena Aghaee(+6 more)

Abstract:The objective of this study was to develop an automated pipeline that enhances thyroid disease classification using thyroid scintigraphy images, aiming to decrease assessment time and increase diagnostic accuracy. Anterior thyroid scintigraphy images from 2,643 patients were collected and categorized into diffuse goiter (DG), multinodal goiter (MNG), and thyroiditis (TH) based on clinical reports, and then segmented by an expert. A ResUNet model was trained to perform auto-segmentation. Radiomic features were extracted from both physician (scenario 1) and ResUNet segmentations (scenario 2), followed by omitting highly correlated features using Spearman's correlation, and feature selection using Recursive Feature Elimination (RFE) with XGBoost as the core. All models were trained under leave-one-center-out cross-validation (LOCOCV) scheme, where nine instances of algorithms were iteratively trained and validated on data from eight centers and tested on the ninth for both scenarios separately. Segmentation performance was assessed using the Dice similarity coefficient (DSC), while classification performance was assessed using metrics, such as precision, recall, F1-score, accuracy, area under the Receiver Operating Characteristic (ROC AUC), and area under the precision-recall curve (PRC AUC). ResUNet achieved DSC values of 0.84$\pm$0.03, 0.71$\pm$0.06, and 0.86$\pm$0.02 for MNG, TH, and DG, respectively. Classification in scenario 1 achieved an accuracy of 0.76$\pm$0.04 and a ROC AUC of 0.92$\pm$0.02 while in scenario 2, classification yielded an accuracy of 0.74$\pm$0.05 and a ROC AUC of 0.90$\pm$0.02. The automated pipeline demonstrated comparable performance to physician segmentations on several classification metrics across different classes, effectively reducing assessment time while maintaining high diagnostic accuracy. Code available at: https://github.com/ahxmeds/thyroidiomics.git.

* 7 pages, 4 figures, 2 tables

Via

Access Paper or Ask Questions