Abstract:We develop an algorithm that finds the consensus of many different clustering solutions of a graph. We formulate the problem as a median set partitioning problem and propose a greedy optimization technique. Unlike other approaches that find median set partitions, our algorithm takes graph structure into account and finds a comparable quality solution much faster than the other approaches. For graphs with known communities, our consensus partition captures the actual community structure more accurately than alternative approaches. To make it applicable to large graphs, we remove sequential dependencies from our algorithm and design a parallel algorithm. Our parallel algorithm achieves 35x speedup when utilizing 64 processing cores for large real-world graphs from single-cell experiments.
Abstract:Core computations in Graph Neural Network (GNN) training and inference are often mapped to sparse matrix operations such as sparse-dense matrix multiplication (SpMM). These sparse operations are harder to optimize by manual tuning because their performance depends significantly on the sparsity of input graphs, GNN models, and computing platforms. To address this challenge, we present iSpLib, a PyTorch-based C++ library equipped with auto-tuned sparse operations. iSpLib expedites GNN training with a cache-enabled backpropagation that stores intermediate matrices in local caches. The library offers a user-friendly Python plug-in that allows users to take advantage of our optimized PyTorch operations out-of-the-box for any existing linear algebra-based PyTorch implementation of popular GNNs (Graph Convolution Network, GraphSAGE, Graph Inference Network, etc.) with only two lines of additional code. We demonstrate that iSpLib obtains up to 27x overall training speedup compared to the equivalent PyTorch 2.1.0 and PyTorch Geometric 2.4.0 implementations on the CPU. Our library is publicly available at https://github.com/HipGraph/iSpLib (https://doi.org/10.5281/zenodo.10806511).
Abstract:Graph neural networks (GNNs) are popular machine learning models for graphs with many applications across scientific domains. However, GNNs are considered black box models, and it is challenging to understand how the model makes predictions. Game theory-based Shapley value approaches are popular explanation methods in other domains but are not well-studied for graphs. Some studies have proposed Shapley value-based GNN explanations, yet they have several limitations: they consider limited samples to approximate Shapley values; some mainly focus on small and large coalition sizes, and they are an order of magnitude slower than other explanation methods, making them inapplicable to even moderate-size graphs. In this work, we propose GNNShap, which provides explanations for edges since they provide more natural explanations for graphs and more fine-grained explanations. We overcome the limitations by sampling from all coalition sizes, parallelizing the sampling on GPUs, and speeding up model predictions by batching. GNNShap gives better fidelity scores and faster explanations than baselines on real-world datasets.
Abstract:Graph Neural Networks (GNNs) are widely used to perform different machine learning tasks on graphs. As the size of the graphs grows, and the GNNs get deeper, training and inference time become costly in addition to the memory requirement. Thus, without sacrificing accuracy, graph sparsification, or model compression becomes a viable approach for graph learning tasks. A few existing techniques only study the sparsification of graphs and GNN models. In this paper, we develop a SparseGCN pipeline to study all possible sparsification in GNN. We provide a theoretical analysis and empirically show that it can add up to 11.6\% additional sparsity to the embedding matrix without sacrificing the accuracy of the commonly used benchmark graph datasets.
Abstract:We propose a unified framework for coordinating Unmanned Aerial Vehicle (UAV) swarms operating under time-varying communication networks. Our framework builds on the concept of graphical games, which we argue provides a compelling paradigm to subsume the interaction structures found in networked UAV swarms thanks to the shared local neighborhood properties. We present a general-sum, factorizable payoff function for cooperative UAV swarms based on the aggregated local states and yield a Nash equilibrium for the stage games. Further, we propose a decomposition-based approach to solve stage-graphical games in a scalable and decentralized fashion by approximating virtual, mean neighborhoods. Finally, we discuss extending the proposed framework toward general-sum stochastic games by leveraging deep Q-learning and model-predictive control.
Abstract:Most real-world networks contain well-defined community structures where nodes are densely connected internally within communities. To learn from these networks, we develop MarkovGNN that captures the formation and evolution of communities directly in different convolutional layers. Unlike most Graph Neural Networks (GNNs) that consider a static graph at every layer, MarkovGNN generates different stochastic matrices using a Markov process and then uses these community-capturing matrices in different layers. MarkovGNN is a general approach that could be used with most existing GNNs. We experimentally show that MarkovGNN outperforms other GNNs for clustering, node classification, and visualization tasks. The source code of MarkovGNN is publicly available at \url{https://github.com/HipGraph/MarkovGNN}.
Abstract:Graph representation learning is a fast-growing field where one of the main objectives is to generate meaningful representations of graphs in lower-dimensional spaces. The learned embeddings have been successfully applied to perform various prediction tasks, such as link prediction, node classification, clustering, and visualization. The collective effort of the graph learning community has delivered hundreds of methods, but no single method excels under all evaluation metrics such as prediction accuracy, running time, scalability, etc. This survey aims to evaluate all major classes of graph embedding methods by considering algorithmic variations, parameter selections, scalability, hardware and software platforms, downstream ML tasks, and diverse datasets. We organized graph embedding techniques using a taxonomy that includes methods from manual feature engineering, matrix factorization, shallow neural networks, and deep graph convolutional networks. We evaluated these classes of algorithms for node classification, link prediction, clustering, and visualization tasks using widely used benchmark graphs. We designed our experiments on top of PyTorch Geometric and DGL libraries and run experiments on different multicore CPU and GPU platforms. We rigorously scrutinize the performance of embedding methods under various performance metrics and summarize the results. Thus, this paper may serve as a comparative guide to help users select methods that are most suitable for their tasks.
Abstract:We present a machine learning method to predict extreme hydrologic events from spatially and temporally varying hydrological and meteorological data. We used a timestep reduction technique to reduce the computational and memory requirements and trained a bidirection LSTM network to predict soil water and stream flow from time series data observed and simulated over eighty years in the Wabash River Watershed. We show that our simple model can be trained much faster than complex attention networks such as GeoMAN without sacrificing accuracy. Based on the predicted values of soil water and stream flow, we predict the occurrence and severity of extreme hydrologic events such as droughts. We also demonstrate that extreme events can be predicted in geographical locations separate from locations observed during the training process. This spatially-inductive setting enables us to predict extreme events in other areas in the US and other parts of the world using our model trained with the Wabash Basin data.
Abstract:Electronic Health Record (EHR) data has been of tremendous utility in Artificial Intelligence (AI) for healthcare such as predicting future clinical events. These tasks, however, often come with many challenges when using classical machine learning models due to a myriad of factors including class imbalance and data heterogeneity (i.e., the complex intra-class variances). To address some of these research gaps, this paper leverages the exciting contrastive learning framework and proposes a novel contrastive regularized clinical classification model. The contrastive loss is found to substantially augment EHR-based prediction: it effectively characterizes the similar/dissimilar patterns (by its "push-and-pull" form), meanwhile mitigating the highly skewed class distribution by learning more balanced feature spaces (as also echoed by recent findings). In particular, when naively exporting the contrastive learning to the EHR data, one hurdle is in generating positive samples, since EHR data is not as amendable to data augmentation as image data. To this end, we have introduced two unique positive sampling strategies specifically tailored for EHR data: a feature-based positive sampling that exploits the feature space neighborhood structure to reinforce the feature learning; and an attribute-based positive sampling that incorporates pre-generated patient similarity metrics to define the sample proximity. Both sampling approaches are designed with an awareness of unique high intra-class variance in EHR data. Our overall framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data with a total of 5,712 patients admitted to a large, urban health system. Specifically, our method reaches a high AUROC prediction score of 0.959, which outperforms other baselines and alternatives: cross-entropy(0.873) and focal loss(0.931).
Abstract:Computational prediction of in-hospital mortality in the setting of an intensive care unit can help clinical practitioners to guide care and make early decisions for interventions. As clinical data are complex and varied in their structure and components, continued innovation of modeling strategies is required to identify architectures that can best model outcomes. In this work, we train a Heterogeneous Graph Model (HGM) on Electronic Health Record data and use the resulting embedding vector as additional information added to a Convolutional Neural Network (CNN) model for predicting in-hospital mortality. We show that the additional information provided by including time as a vector in the embedding captures the relationships between medical concepts, lab tests, and diagnoses, which enhances predictive performance. We find that adding HGM to a CNN model increases the mortality prediction accuracy up to 4\%. This framework serves as a foundation for future experiments involving different EHR data types on important healthcare prediction tasks.