Abstract:Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice and plays an important role in the generalization of modern machine learning. However, prior research has revealed instances where the generalization performance of SGD is worse than ridge regression due to uneven optimization along different dimensions. Preconditioning offers a natural solution to this issue by rebalancing optimization across different directions. Yet, the extent to which preconditioning can enhance the generalization performance of SGD and whether it can bridge the existing gap with ridge regression remains uncertain. In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem. We make a comprehensive comparison between preconditioned SGD and (standard \& preconditioned) ridge regression. Our study makes several key contributions toward understanding and improving SGD with preconditioning. First, we establish excess risk bounds (generalization performance) for preconditioned SGD and ridge regression under an arbitrary preconditions matrix. Second, leveraging the excessive risk characterization of preconditioned SGD and ridge regression, we show that (through construction) there exists a simple preconditioned matrix that can outperform (standard \& preconditioned) ridge regression. Finally, we show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples while maintaining a theoretical advantage over ridge regression. Our empirical results align with our theoretical findings, collectively showcasing the enhanced regularization effect of preconditioned SGD.
Abstract:Many computer vision and machine learning problems are modelled as learning tasks on heterogeneous graphs, featuring a wide array of relations from diverse types of nodes and edges. Heterogeneous graph neural networks (HGNNs) stand out as a promising neural model class designed for heterogeneous graphs. Built on traditional GNNs, existing HGNNs employ different parameter spaces to model the varied relationships. However, the practical effectiveness of existing HGNNs is often limited to simple heterogeneous graphs with few relation types. This paper first highlights and demonstrates that the standard approach employed by existing HGNNs inevitably leads to parameter explosion and relation collapse, making HGNNs less effective or impractical for complex heterogeneous graphs with numerous relation types. To overcome this issue, we introduce a novel framework, Blend&Grind-HGNN (BG-HGNN), which effectively tackles the challenges by carefully integrating different relations into a unified feature space manageable by a single set of parameters. This results in a refined HGNN method that is more efficient and effective in learning from heterogeneous graphs, especially when the number of relations grows. Our empirical studies illustrate that BG-HGNN significantly surpasses existing HGNNs in terms of parameter efficiency (up to 28.96 $\times$), training throughput (up to 8.12 $\times$), and accuracy (up to 1.07 $\times$).
Abstract:Many computer vision and machine learning problems are modelled as learning tasks on graphs, where graph neural networks (GNNs) have emerged as a dominant tool for learning representations of graph-structured data. A key feature of GNNs is their use of graph structures as input, enabling them to exploit the graphs' inherent topological properties-known as the topology awareness of GNNs. Despite the empirical successes of GNNs, the influence of topology awareness on generalization performance remains unexplored, particularly for node-level tasks that diverge from the assumption of data being independent and identically distributed (I.I.D.). The precise definition and characterization of the topology awareness of GNNs, especially concerning different topological features, are still unclear. This paper introduces a comprehensive framework to characterize the topology awareness of GNNs across any topological feature. Using this framework, we investigate the effects of topology awareness on GNN generalization performance. Contrary to the prevailing belief that enhancing the topology awareness of GNNs is always advantageous, our analysis reveals a critical insight: improving the topology awareness of GNNs may inadvertently lead to unfair generalization across structural groups, which might not be desired in some scenarios. Additionally, we conduct a case study using the intrinsic graph metric, the shortest path distance, on various benchmark datasets. The empirical results of this case study confirm our theoretical insights. Moreover, we demonstrate the practical applicability of our framework by using it to tackle the cold start problem in graph active learning.
Abstract:Memory-based Temporal Graph Neural Networks (MTGNNs) are a class of temporal graph neural networks that utilize a node memory module to capture and retain long-term temporal dependencies, leading to superior performance compared to memory-less counterparts. However, the iterative reading and updating process of the memory module in MTGNNs to obtain up-to-date information needs to follow the temporal dependencies. This introduces significant overhead and limits training throughput. Existing optimizations for static GNNs are not directly applicable to MTGNNs due to differences in training paradigm, model architecture, and the absence of a memory module. Moreover, they do not effectively address the challenges posed by temporal dependencies, making them ineffective for MTGNN training. In this paper, we propose MSPipe, a general and efficient framework for MTGNNs that maximizes training throughput while maintaining model accuracy. Our design addresses the unique challenges associated with fetching and updating node memory states in MTGNNs by integrating staleness into the memory module. However, simply introducing a predefined staleness bound in the memory module to break temporal dependencies may lead to suboptimal performance and lack of generalizability across different models and datasets. To solve this, we introduce an online pipeline scheduling algorithm in MSPipe that strategically breaks temporal dependencies with minimal staleness and delays memory fetching to obtain fresher memory states. Moreover, we design a staleness mitigation mechanism to enhance training convergence and model accuracy. We provide convergence analysis and prove that MSPipe maintains the same convergence rate as vanilla sample-based GNN training. Experimental results show that MSPipe achieves up to 2.45x speed-up without sacrificing accuracy, making it a promising solution for efficient MTGNN training.
Abstract:Incremental learning is a machine learning approach that involves training a model on a sequence of tasks, rather than all tasks at once. This ability to learn incrementally from a stream of tasks is crucial for many real-world applications. However, incremental learning is a challenging problem on graph-structured data, as many graph-related problems involve prediction tasks for each individual node, known as Node-wise Graph Incremental Learning (NGIL). This introduces non-independent and non-identically distributed characteristics in the sample data generation process, making it difficult to maintain the performance of the model as new tasks are added. In this paper, we focus on the inductive NGIL problem, which accounts for the evolution of graph structure (structural shift) induced by emerging tasks. We provide a formal formulation and analysis of the problem, and propose a novel regularization-based technique called Structural-Shift-Risk-Mitigation (SSRM) to mitigate the impact of the structural shift on catastrophic forgetting of the inductive NGIL problem. We show that the structural shift can lead to a shift in the input distribution for the existing tasks, and further lead to an increased risk of catastrophic forgetting. Through comprehensive empirical studies with several benchmark datasets, we demonstrate that our proposed method, Structural-Shift-Risk-Mitigation (SSRM), is flexible and easy to adapt to improve the performance of state-of-the-art GNN incremental learning frameworks in the inductive setting.
Abstract:In this study, we explore the synergy of deep learning and financial market applications, focusing on pair trading. This market-neutral strategy is integral to quantitative finance and is apt for advanced deep-learning techniques. A pivotal challenge in pair trading is discerning temporal correlations among entities, necessitating the integration of diverse data modalities. Addressing this, we introduce a novel framework, Multi-modal Temporal Relation Graph Learning (MTRGL). MTRGL combines time series data and discrete features into a temporal graph and employs a memory-based temporal graph neural network. This approach reframes temporal correlation identification as a temporal graph link prediction task, which has shown empirical success. Our experiments on real-world datasets confirm the superior performance of MTRGL, emphasizing its promise in refining automated pair trading strategies.
Abstract:Memory-based Dynamic Graph Neural Networks (MDGNNs) are a family of dynamic graph neural networks that leverage a memory module to extract, distill, and memorize long-term temporal dependencies, leading to superior performance compared to memory-less counterparts. However, training MDGNNs faces the challenge of handling entangled temporal and structural dependencies, requiring sequential and chronological processing of data sequences to capture accurate temporal patterns. During the batch training, the temporal data points within the same batch will be processed in parallel, while their temporal dependencies are neglected. This issue is referred to as temporal discontinuity and restricts the effective temporal batch size, limiting data parallelism and reducing MDGNNs' flexibility in industrial applications. This paper studies the efficient training of MDGNNs at scale, focusing on the temporal discontinuity in training MDGNNs with large temporal batch sizes. We first conduct a theoretical study on the impact of temporal batch size on the convergence of MDGNN training. Based on the analysis, we propose PRES, an iterative prediction-correction scheme combined with a memory coherence learning objective to mitigate the effect of temporal discontinuity, enabling MDGNNs to be trained with significantly larger temporal batches without sacrificing generalization performance. Experimental results demonstrate that our approach enables up to a 4x larger temporal batch (3.4x speed-up) during MDGNN training.
Abstract:Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at https://github.com/joapolarbear/cdmpp.
Abstract:Inductive node-wise graph incremental learning is a challenging task due to the dynamic nature of evolving graphs and the dependencies between nodes. In this paper, we propose a novel experience replay framework, called Structure-Evolution-Aware Experience Replay (SEA-ER), that addresses these challenges by leveraging the topological awareness of GNNs and importance reweighting technique. Our framework effectively addresses the data dependency of node prediction problems in evolving graphs, with a theoretical guarantee that supports its effectiveness. Through empirical evaluation, we demonstrate that our proposed framework outperforms the current state-of-the-art GNN experience replay methods on several benchmark datasets, as measured by metrics such as accuracy and forgetting.
Abstract:Graph Neural Networks(GNNs) are a family of neural models tailored for graph-structure data and have shown superior performance in learning representations for graph-structured data. However, training GNNs on large graphs remains challenging and a promising direction is distributed GNN training, which is to partition the input graph and distribute the workload across multiple machines. The key bottleneck of the existing distributed GNNs training framework is the across-machine communication induced by the dependency on the graph data and aggregation operator of GNNs. In this paper, we study the communication complexity during distributed GNNs training and propose a simple lossless communication reduction method, termed the Aggregation before Communication (ABC) method. ABC method exploits the permutation-invariant property of the GNNs layer and leads to a paradigm where vertex-cut is proved to admit a superior communication performance than the currently popular paradigm (edge-cut). In addition, we show that the new partition paradigm is particularly ideal in the case of dynamic graphs where it is infeasible to control the edge placement due to the unknown stochastic of the graph-changing process.