Abstract:Recently, topological data analysis (TDA) has become a trending topic in data science and engineering. However, the key technique of TDA, i.e., persistent homology, is defined on point cloud data, which restricts its scope. In this work, we propose persistent de Rham-Hodge Laplacian, or persistent Hodge Laplacian (PHL) for abbreviation, for the TDA on manifolds with boundaries, or volumetric data. Specifically, we extended the evolutionary de Rham-Hodge theory from the Lagrangian formulation to the Eulerian formulation via structure-persevering Cartesian grids, and extended the persistent Laplacian on point clouds to persistent (de Rham-)Hodge Laplacian on nested families of manifolds with appropriate boundary conditions. The proposed PHL facilitates the machine learning and deep learning prediction of volumetric data. For a proof-of-principle application of the proposed PHL, we propose a persistent Hodge Laplacian learning (PHLL) algorithm for data on manifolds or volumetric data. To this end, we showcase the PHLL prediction of protein-ligand binding affinities in two benchmark datasets. Our numerical experiments highlight the power and promise of PHLL.
Abstract:Data sets with imbalanced class sizes, often where one class size is much smaller than that of others, occur extremely often in various applications, including those with biological foundations, such as drug discovery and disease diagnosis. Thus, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to detect can result in heavy costs. However, many data classification algorithms do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this paper, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) techniques and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification problems on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed method not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer model based on an attention mechanism for self-supervised learning. Additionally, the method implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed model is validated using six molecular data sets, and we also provide a thorough comparison to other competing algorithms. The computational experiments show that the proposed method performs better than competing techniques even when the class imbalance ratio is very high.
Abstract:Topological deep learning (TDL) is a rapidly evolving field that uses topological features to understand and design deep learning models. This paper posits that TDL may complement graph representation learning and geometric deep learning by incorporating topological concepts, and can thus provide a natural choice for various machine learning settings. To this end, this paper discusses open problems in TDL, ranging from practical benefits to theoretical foundations. For each problem, it outlines potential solutions and future research opportunities. At the same time, this paper serves as an invitation to the scientific community to actively participate in TDL research to unlock the potential of this emerging field.
Abstract:The escalating drug addiction crisis in the United States underscores the urgent need for innovative therapeutic strategies. This study embarked on an innovative and rigorous strategy to unearth potential drug repurposing candidates for opioid and cocaine addiction treatment, bridging the gap between transcriptomic data analysis and drug discovery. We initiated our approach by conducting differential gene expression analysis on addiction-related transcriptomic data to identify key genes. We propose a novel topological differentiation to identify key genes from a protein-protein interaction (PPI) network derived from DEGs. This method utilizes persistent Laplacians to accurately single out pivotal nodes within the network, conducting this analysis in a multiscale manner to ensure high reliability. Through rigorous literature validation, pathway analysis, and data-availability scrutiny, we identified three pivotal molecular targets, mTOR, mGluR5, and NMDAR, for drug repurposing from DrugBank. We crafted machine learning models employing two natural language processing (NLP)-based embeddings and a traditional 2D fingerprint, which demonstrated robust predictive ability in gauging binding affinities of DrugBank compounds to selected targets. Furthermore, we elucidated the interactions of promising drugs with the targets and evaluated their drug-likeness. This study delineates a multi-faceted and comprehensive analytical framework, amalgamating bioinformatics, topological data analysis and machine learning, for drug repurposing in addiction treatment, setting the stage for subsequent experimental validation. The versatility of the methods we developed allows for applications across a range of diseases and transcriptomic datasets.
Abstract:Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that has stimulated enormous interest in statistics, data science, and computational biology due to the high dimensionality, complexity, and large scale associated with scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique approach due to its meta-gene interpretation of resulting low-dimensional components. However, NMF approaches suffer from the lack of multiscale analysis. This work introduces two persistent Laplacian regularized NMF methods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By employing a total of 12 datasets, we demonstrate that the proposed TNMF and rTNMF significantly outperform all other NMF-based methods. We have also utilized TNMF and rTNMF for the visualization of popular Uniform Manifold Approximation and Projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE).
Abstract:Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L$_{2,1}$ norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins.
Abstract:ChatGPT represents a significant milestone in the field of artificial intelligence (AI), finding widespread applications across diverse domains. However, its effectiveness in mathematical contexts has been somewhat constrained by its susceptibility to conceptual errors. Concurrently, topological data analysis (TDA), a relatively new discipline, has garnered substantial interest in recent years. Nonetheless, the advancement of TDA is impeded by the limited understanding of computational algorithms and coding proficiency among theoreticians. This work endeavors to bridge the gap between theoretical topological concepts and their practical implementation in computational topology through the utilization of ChatGPT. We showcase how a pure theoretician, devoid of computational experience and coding skills, can effectively transform mathematical formulations and concepts into functional code for computational topology with the assistance of ChatGPT. Our strategy outlines a productive process wherein a mathematician trains ChatGPT on pure mathematical concepts, steers ChatGPT towards generating computational topology code, and subsequently validates the generated code using established examples. Our specific case studies encompass the computation of Betti numbers, Laplacian matrices, and Dirac matrices for simplicial complexes, as well as the persistence of various homologies and Laplacians. Furthermore, we explore the application of ChatGPT in computing recently developed topological theories for hypergraphs and digraphs. This work serves as an initial step towards effectively transforming pure mathematical theories into practical computational tools, with the ultimate goal of enabling real applications across diverse fields.
Abstract:The birth of ChatGPT, a cutting-edge language model chatbot developed by OpenAI, ushered in a new era in AI, and this paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anti-cocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers towards innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of potential pharmaceutical solutions. This paper not only explores the integration of advanced AI in drug discovery but also reimagines the landscape by advocating for AI-powered chatbots as trailblazers in revolutionizing therapeutic innovation.
Abstract:Pain is a significant global health issue, and the current treatment options for pain management have limitations in terms of effectiveness, side effects, and potential for addiction. There is a pressing need for improved pain treatments and the development of new drugs. Voltage-gated sodium channels, particularly Nav1.3, Nav1.7, Nav1.8, and Nav1.9, play a crucial role in neuronal excitability and are predominantly expressed in the peripheral nervous system. Targeting these channels may provide a means to treat pain while minimizing central and cardiac adverse effects. In this study, we construct protein-protein interaction (PPI) networks based on pain-related sodium channels and develop a corresponding drug-target interaction (DTI) network to identify potential lead compounds for pain management. To ensure reliable machine learning predictions, we carefully select 111 inhibitor datasets from a pool of over 1,000 targets in the PPI network. We employ three distinct machine learning algorithms combined with advanced natural language processing (NLP)-based embeddings, specifically pre-trained transformer and autoencoder representations. Through a systematic screening process, we evaluate the side effects and repurposing potential of over 150,000 drug candidates targeting Nav1.7 and Nav1.8 sodium channels. Additionally, we assess the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of these candidates to identify leads with near-optimal characteristics. Our strategy provides an innovative platform for the pharmacological development of pain treatments, offering the potential for improved efficacy and reduced side effects.
Abstract:Opioid Use Disorder (OUD) has emerged as a significant global public health issue, with complex multifaceted conditions. Due to the lack of effective treatment options for various conditions, there is a pressing need for the discovery of new medications. In this study, we propose a deep generative model that combines a stochastic differential equation (SDE)-based diffusion modeling with the latent space of a pretrained autoencoder model. The molecular generator enables efficient generation of molecules that are effective on multiple targets, specifically the mu, kappa, and delta opioid receptors. Furthermore, we assess the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of the generated molecules to identify drug-like compounds. To enhance the pharmacokinetic properties of some lead compounds, we employ a molecular optimization approach. We obtain a diverse set of drug-like molecules. We construct binding affinity predictors by integrating molecular fingerprints derived from autoencoder embeddings, transformer embeddings, and topological Laplacians with advanced machine learning algorithms. Further experimental studies are needed to evaluate the pharmacological effects of these drug-like compounds for OUD treatment. Our machine learning platform serves as a valuable tool in designing and optimizing effective molecules for addressing OUD.