Abstract:Model merging is an effective strategy to merge multiple models for enhancing model performances, and more efficient than ensemble learning as it will not introduce extra computation into inference. However, limited research explores if the merging process can occur within one model and enhance the model's robustness, which is particularly critical in the medical image domain. In the paper, we are the first to propose in-model merging (InMerge), a novel approach that enhances the model's robustness by selectively merging similar convolutional kernels in the deep layers of a single convolutional neural network (CNN) during the training process for classification. We also analytically reveal important characteristics that affect how in-model merging should be performed, serving as an insightful reference for the community. We demonstrate the feasibility and effectiveness of this technique for different CNN architectures on 4 prevalent datasets. The proposed InMerge-trained model surpasses the typically-trained model by a substantial margin. The code will be made public.
Abstract:Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.
Abstract:Recently, test-time adaptation has garnered attention as a method for tuning models without labeled data. The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time primarily focuses on tuning learnable prompts; however, this approach overlooks potential distribution shifts in the visual representations themselves. In this work, we address this limitation by introducing Test-Time Noise Tuning (TNT), a novel method for handling unpredictable shifts in the visual space. TNT leverages, for the first time, a noise adaptation strategy that optimizes learnable noise directly in the visual input space, enabling adaptive feature learning from a single test sample. We further introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding distances, ensuring consistent feature representations across views. Combined with scaled logits and confident view selection at inference, TNT substantially enhances VLM generalization and calibration, achieving average gains of +7.38% on natural distributions benchmark and +0.80% on cross-dataset evaluations over zero-shot CLIP. These improvements lay a strong foundation for adaptive out-of-distribution handling.
Abstract:Traditional remote spirometry lacks the precision required for effective pulmonary monitoring. We present a novel, non-invasive approach using multimodal predictive models that integrate RGB or thermal video data with patient metadata. Our method leverages energy-efficient Spiking Neural Networks (SNNs) for the regression of Peak Expiratory Flow (PEF) and classification of Forced Expiratory Volume (FEV1) and Forced Vital Capacity (FVC), using lightweight CNNs to overcome SNN limitations in regression tasks. Multimodal data integration is improved with a Multi-Head Attention Layer, and we employ K-Fold validation and ensemble learning to boost robustness. Using thermal data, our SNN models achieve 92% accuracy on a breathing-cycle basis and 99.5% patient-wise. PEF regression models attain Relative RMSEs of 0.11 (thermal) and 0.26 (RGB), with an MAE of 4.52% for FEV1/FVC predictions, establishing state-of-the-art performance. Code and dataset can be found on https://github.com/ahmed-sharshar/RespiroDynamics.git
Abstract:Identifying key pathological features in brain MRIs is crucial for the long-term survival of glioma patients. However, manual segmentation is time-consuming, requiring expert intervention and is susceptible to human error. Therefore, significant research has been devoted to developing machine learning methods that can accurately segment tumors in 3D multimodal brain MRI scans. Despite their progress, state-of-the-art models are often limited by the data they are trained on, raising concerns about their reliability when applied to diverse populations that may introduce distribution shifts. Such shifts can stem from lower quality MRI technology (e.g., in sub-Saharan Africa) or variations in patient demographics (e.g., children). The BraTS-2024 challenge provides a platform to address these issues. This study presents our methodology for segmenting tumors in the BraTS-2024 SSA and Pediatric Tumors tasks using MedNeXt, comprehensive model ensembling, and thorough postprocessing. Our approach demonstrated strong performance on the unseen validation set, achieving an average Dice Similarity Coefficient (DSC) of 0.896 on the BraTS-2024 SSA dataset and an average DSC of 0.830 on the BraTS Pediatric Tumor dataset. Additionally, our method achieved an average Hausdorff Distance (HD95) of 14.682 on the BraTS-2024 SSA dataset and an average HD95 of 37.508 on the BraTS Pediatric dataset. Our GitHub repository can be accessed here: Project Repository : https://github.com/python-arch/BioMbz-Optimizing-Brain-Tumor-Segmentation-with-MedNeXt-BraTS-2024-SSA-and-Pediatrics
Abstract:Weight-averaged model-merging has emerged as a powerful approach in deep learning, capable of enhancing model performance without fine-tuning or retraining. However, the underlying mechanisms that explain its effectiveness remain largely unexplored. In this paper, we investigate this technique from three novel perspectives to provide deeper insights into how and why weight-averaged model-merging works: (1) we examine the intrinsic patterns captured by the learning of the model weights, through the visualizations of their patterns on several datasets, showing that these weights often encode structured and interpretable patterns; (2) we investigate model ensemble merging strategies based on averaging on weights versus averaging on features, providing detailed analyses across diverse architectures and datasets; and (3) we explore the impact on model-merging prediction stability in terms of changing the parameter magnitude, revealing insights into the way of weight averaging works as regularization by showing the robustness across different parameter scales. Our findings shed light on the "black box" of weight-averaged model-merging, offering valuable insights and practical recommendations that advance the model-merging process.
Abstract:In the complex realm of cognitive disorders, Alzheimer's disease (AD) and vascular dementia (VaD) are the two most prevalent dementia types, presenting entangled symptoms yet requiring distinct treatment approaches. The crux of effective treatment in slowing neurodegeneration lies in early, accurate diagnosis, as this significantly assists doctors in determining the appropriate course of action. However, current diagnostic practices often delay VaD diagnosis, impeding timely intervention and adversely affecting patient prognosis. This paper presents an innovative multi-omics approach to accurately differentiate AD from VaD, achieving a diagnostic accuracy of 89.25%. The proposed method segments the longitudinal MRI scans and extracts advanced radiomics features. Subsequently, it synergistically integrates the radiomics features with an ensemble of clinical, cognitive, and genetic data to provide state-of-the-art diagnostic accuracy, setting a new benchmark in classification accuracy on a large public dataset. The paper's primary contribution is proposing a comprehensive methodology utilizing multi-omics data to provide a nuanced understanding of dementia subtypes. Additionally, the paper introduces an interpretable model to enhance clinical decision-making coupled with a novel model architecture for evaluating treatment efficacy. These advancements lay the groundwork for future work not only aimed at improving differential diagnosis but also mitigating and preventing the progression of dementia.
Abstract:One of the most common defense strategies against model poisoning in federated learning is to employ a robust aggregator mechanism that makes the training more resilient. Many of the existing Byzantine robust aggregators provide theoretical guarantees and are empirically effective against certain categories of attacks. However, we observe that certain high-strength attacks can subvert the aggregator and collapse the training. In addition, most aggregators require identifying tolerant settings to converge. Impact of attacks becomes more pronounced when the number of Byzantines is near-majority, and becomes harder to evade if the attacker is omniscient with access to data, honest updates and aggregation methods. Motivated by these observations, we develop a robust aggregator called FedRISE for cross-silo FL that is consistent and less susceptible to poisoning updates by an omniscient attacker. The proposed method explicitly determines the optimal direction of each gradient through a sign-voting strategy that uses variance-reduced sparse gradients. We argue that vote weighting based on the cosine similarity of raw gradients is misleading, and we introduce a sign-based gradient valuation function that ignores the gradient magnitude. We compare our method against 8 robust aggregators under 6 poisoning attacks on 3 datasets and architectures. Our results show that existing robust aggregators collapse for at least some attacks under severe settings, while FedRISE demonstrates better robustness because of a stringent gradient inclusion formulation.
Abstract:High-resolution images are preferable in medical imaging domain as they significantly improve the diagnostic capability of the underlying method. In particular, high resolution helps substantially in improving automatic image segmentation. However, most of the existing deep learning-based techniques for medical image segmentation are optimized for input images having small spatial dimensions and perform poorly on high-resolution images. To address this shortcoming, we propose a parallel-in-branch architecture called TransResNet, which incorporates Transformer and CNN in a parallel manner to extract features from multi-resolution images independently. In TransResNet, we introduce Cross Grafting Module (CGM), which generates the grafted features, enriched in both global semantic and low-level spatial details, by combining the feature maps from Transformer and CNN branches through fusion and self-attention mechanism. Moreover, we use these grafted features in the decoding process, increasing the information flow for better prediction of the segmentation mask. Extensive experiments on ten datasets demonstrate that TransResNet achieves either state-of-the-art or competitive results on several segmentation tasks, including skin lesion, retinal vessel, and polyp segmentation. The source code and pre-trained models are available at https://github.com/Sharifmhamza/TransResNet.
Abstract:Recent attention-based volumetric segmentation (VS) methods have achieved remarkable performance in the medical domain which focuses on modeling long-range dependencies. However, for voxel-wise prediction tasks, discriminative local features are key components for the performance of the VS models which is missing in attention-based VS methods. Aiming at resolving this issue, we deliberately incorporate the convolutional encoder branch with transformer backbone to extract local and global features in a parallel manner and aggregate them in Cross Feature Mixer Module (CFMM) for better prediction of segmentation mask. Consequently, we observe that the derived model, Y-CT-Net, achieves competitive performance on multiple medical segmentation tasks. For example, on multi-organ segmentation, Y-CT-Net achieves an 82.4% dice score, surpassing well-tuned VS Transformer/CNN-like baselines UNETR/ResNet-3D by 2.9%/1.4%. With the success of Y-CT-Net, we extend this concept with hybrid attention models, that derived Y-CH-Net model, which brings a 3% improvement in terms of HD95 score for same segmentation task. The effectiveness of both models Y-CT-Net and Y-CH-Net verifies our hypothesis and motivates us to initiate the concept of Y-CA-Net, a versatile generic architecture based upon any two encoders and a decoder backbones, to fully exploit the complementary strengths of both convolution and attention mechanisms. Based on experimental results, we argue Y-CA-Net is a key player in achieving superior results for volumetric segmentation.