Abstract:Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
Abstract:Accurately labeling biomedical data presents a challenge. Traditional semi-supervised learning methods often under-utilize available unlabeled data. To address this, we propose a novel reliability-based training data cleaning method employing inductive conformal prediction (ICP). This method capitalizes on a small set of accurately labeled training data and leverages ICP-calculated reliability metrics to rectify mislabeled data and outliers within vast quantities of noisy training data. The efficacy of the method is validated across three classification tasks within distinct modalities: filtering drug-induced-liver-injury (DILI) literature with title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced through label permutation. Results show significant enhancements in classification performance: accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4%), AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% and 69.8%), and accuracy and macro-average F1 score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% and 89.0%). Our method offers the potential to substantially boost classification performance in multi-modal biomedical machine learning tasks. Importantly, it accomplishes this without necessitating an excessive volume of meticulously curated training data.