Abstract:Complex cell signaling systems -- governed by varying protein abundances and interactions -- generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations -- such as biological functions, cellular locations, signaling pathways, related diseases, and drugs -- with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at https://github.com/FuhaiLiAiLab/OmniCellTOSG.
Abstract:Biomarker identification is critical for precise disease diagnosis and understanding disease pathogenesis in omics data analysis, like using fold change and regression analysis. Graph neural networks (GNNs) have been the dominant deep learning model for analyzing graph-structured data. However, we found two major limitations of existing GNNs in omics data analysis, i.e., limited-prediction (diagnosis) accuracy and limited-reproducible biomarker identification capacity across multiple datasets. The root of the challenges is the unique graph structure of biological signaling pathways, which consists of a large number of targets and intensive and complex signaling interactions among these targets. To resolve these two challenges, in this study, we presented a novel GNN model architecture, named PathFormer, which systematically integrate signaling network, priori knowledge and omics data to rank biomarkers and predict disease diagnosis. In the comparison results, PathFormer outperformed existing GNN models significantly in terms of highly accurate prediction capability ( 30% accuracy improvement in disease diagnosis compared with existing GNN models) and high reproducibility of biomarker ranking across different datasets. The improvement was confirmed using two independent Alzheimer's Disease (AD) and cancer transcriptomic datasets. The PathFormer model can be directly applied to other omics data analysis studies.
Abstract:Effective gene network representation learning is of great importance in bioinformatics to predict/understand the relation of gene profiles and disease phenotypes. Though graph neural networks (GNNs) have been the dominant architecture for analyzing various graph-structured data like social networks, their predicting on gene networks often exhibits subpar performance. In this paper, we formally investigate the gene network representation learning problem and characterize a notion of \textit{universal graph normalization}, where graph normalization can be applied in an universal manner to maximize the expressive power of GNNs while maintaining the stability. We propose a novel UNGNN (Universal Normalized GNN) framework, which leverages universal graph normalization in both the message passing phase and readout layer to enhance the performance of a base GNN. UNGNN has a plug-and-play property and can be combined with any GNN backbone in practice. A comprehensive set of experiments on gene-network-based bioinformatical tasks demonstrates that our UNGNN model significantly outperforms popular GNN benchmarks and provides an overall performance improvement of 16 $\%$ on average compared to previous state-of-the-art (SOTA) baselines. Furthermore, we also evaluate our theoretical findings on other graph datasets where the universal graph normalization is solvable, and we observe that UNGNN consistently achieves the superior performance.