Abstract:Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal edges alongside rich textual attributes. A prior approach to representing DyTAGs leverages pre-trained language models to encode text attributes and subsequently integrates them into dynamic graph models. However, it follows edge-centric modeling, as in dynamic graph learning, which is limited in local structures and fails to exploit the unique characteristics of DyTAGs, leading to suboptimal performance. We observe that DyTAGs inherently comprise three distinct modalities-temporal, textual, and structural-often exhibiting dispersed or even orthogonal distributions, with the first two largely overlooked in existing research. Building on this insight, we propose MoMent, a model-agnostic multi-modal framework that can seamlessly integrate with dynamic graph models for structural modality learning. The core idea is to shift from edge-centric to node-centric modeling, fully leveraging three modalities for node representation. Specifically, MoMent presents non-shared node-centric encoders based on the attention mechanism to capture global temporal and semantic contexts from temporal and textual modalities, together with local structure learning, thus generating modality-specific tokens. To prevent disjoint latent space, we propose a symmetric alignment loss, an auxiliary objective that aligns temporal and textual tokens, ensuring global temporal-semantic consistency with a theoretical guarantee. Last, we design a lightweight adaptor to fuse these tokens, generating comprehensive and cohesive node representations. We theoretically demonstrate that MoMent enhances discriminative power over exclusive edge-centric modeling. Extensive experiments across seven datasets and two downstream tasks show that MoMent achieves up to 33.62% improvement against the baseline using four dynamic graph models.
Abstract:Dynamic graphs are formulated in continuous-time or discrete-time dynamic graphs. They differ in temporal granularity: Continuous-Time Dynamic Graphs (CTDGs) exhibit rapid, localized changes, while Discrete-Time Dynamic Graphs (DTDGs) show gradual, global updates. This difference leads to isolated developments in representation learning for each type. To advance representation learning, recent research attempts to design a unified model capable of handling both CTDGs and DTDGs. However, it typically focuses on local dynamic propagation for temporal structure learning in the time domain, failing to accurately capture the structural evolution associated with each temporal granularity. In addition, existing works-whether specific or unified-often overlook the issue of temporal noise, compromising the model robustness and effectiveness. To better model both types of dynamic graphs, we propose UniDyG, a unified and effective representation learning approach, which scales to large dynamic graphs. We first propose a novel Fourier Graph Attention (FGAT) mechanism that can model local and global structural correlations based on recent neighbors and complex-number selective aggregation, while theoretically ensuring consistent representations of dynamic graphs over time. Based on approximation theory, we demonstrate that FGAT is well-suited to capture the underlying structures in CTDGs and DTDGs. We further enhance FGAT to resist temporal noise by designing an energy-gated unit, which adaptively filters out high-frequency noise according to the energy. Last, we leverage our FGAT mechanisms for temporal structure learning and employ the frequency-enhanced linear function for node-level dynamic updates, facilitating the generation of high-quality temporal embeddings. Extensive experiments show that our UniDyG achieves an average improvement of 14.4% over sixteen baselines across nine dynamic graphs.
Abstract:Missing data imputation, which aims to impute the missing values in the raw datasets to achieve the completeness of datasets, is crucial for modern data-driven models like large language models (LLMs) and has attracted increasing interest over the past decades. Despite its importance, existing solutions for missing data imputation either 1) only support numerical and categorical data or 2) show an unsatisfactory performance due to their design prioritizing text data and the lack of key properties for tabular data imputation. In this paper, we propose UnIMP, a Unified IMPutation framework that leverages LLM and high-order message passing to enhance the imputation of mixed-type data including numerical, categorical, and text data. Specifically, we first introduce a cell-oriented hypergraph to model the table. We then propose BiHMP, an efficient Bidirectional High-order Message-Passing network to aggregate global-local information and high-order relationships on the constructed hypergraph while capturing the inter-column heterogeneity and intra-column homogeneity. To effectively and efficiently align the capacity of the LLM with the information aggregated by BiHMP, we introduce Xfusion, which, together with BiHMP, acts as adapters for the LLM. We follow a pre-training and fine-tuning pipeline to train UnIMP, integrating two optimizations: chunking technique, which divides tables into smaller chunks to enhance efficiency; and progressive masking technique, which gradually adapts the model to learn more complex data patterns. Both theoretical proofs and empirical experiments on 10 real world datasets highlight the superiority of UnIMP over existing techniques.
Abstract:Data generation is a fundamental research problem in data management due to its diverse use cases, ranging from testing database engines to data-specific applications. However, real-world entities often involve complex interactions that cannot be effectively modeled by traditional tabular data. Therefore, graph data generation has attracted increasing attention recently. Although various graph generators have been proposed in the literature, there are three limitations: i) They cannot capture the co-evolution pattern of graph structure and node attributes. ii) Few of them consider edge direction, leading to substantial information loss. iii) Current state-of-the-art dynamic graph generators are based on the temporal random walk, making the simulation process time-consuming. To fill the research gap, we introduce VRDAG, a novel variational recurrent framework for efficient dynamic attributed graph generation. Specifically, we design a bidirectional message-passing mechanism to encode both directed structural knowledge and attribute information of a snapshot. Then, the temporal dependency in the graph sequence is captured by a recurrence state updater, generating embeddings that can preserve the evolution pattern of early graphs. Based on the hidden node embeddings, a conditional variational Bayesian method is developed to sample latent random variables at the neighboring timestep for new snapshot generation. The proposed generation paradigm avoids the time-consuming path sampling and merging process in existing random walk-based methods, significantly reducing the synthesis time. Finally, comprehensive experiments on real-world datasets are conducted to demonstrate the effectiveness and efficiency of the proposed model.
Abstract:In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness leads to suboptimal performance. To overcome this challenge, we present a novel method named contextual representation anchor Network (CRA), where an anchor refers to a cluster center of the representations of molecules and serves as a bridge to transfer enriched contextual knowledge into molecular representations and enhance their expressiveness. CRA introduces a dual-augmentation mechanism that includes context augmentation, which dynamically retrieves analogous unlabeled molecules and captures their task-specific contextual knowledge to enhance the anchors, and anchor augmentation, which leverages the anchors to augment the molecular representations. We evaluate our approach on the MoleculeNet and FS-Mol benchmarks, as well as in domain transfer experiments. The results demonstrate that CRA outperforms the state-of-the-art by 2.60% and 3.28% in AUC and $\Delta$AUC-PR metrics, respectively, and exhibits superior generalization capabilities.
Abstract:With growing demands for data privacy and model robustness, graph unlearning (GU), which erases the influence of specific data on trained GNN models, has gained significant attention. However, existing exact unlearning methods suffer from either low efficiency or poor model performance. While being more utility-preserving and efficient, current approximate unlearning methods are not applicable in the zero-glance privacy setting, where the deleted samples cannot be accessed during unlearning due to immediate deletion requested by regulations. Besides, these approximate methods, which try to directly perturb model parameters still involve high privacy concerns in practice. To fill the gap, we propose Transferable Condensation Graph Unlearning (TCGU), a data-centric solution to zero-glance graph unlearning. Specifically, we first design a two-level alignment strategy to pre-condense the original graph into a small yet utility-preserving dataset. Upon receiving an unlearning request, we fine-tune the pre-condensed data with a low-rank plugin, to directly align its distribution with the remaining graph, thus efficiently revoking the information of deleted data without accessing them. A novel similarity distribution matching approach and a discrimination regularizer are proposed to effectively transfer condensed data and preserve its utility in GNN training, respectively. Finally, we retrain the GNN on the transferred condensed data. Extensive experiments on 6 benchmark datasets demonstrate that TCGU can achieve superior performance in terms of model utility, unlearning efficiency, and unlearning efficacy than existing GU methods.
Abstract:Unsupervised Outlier Detection (UOD) is an important data mining task. With the advance of deep learning, deep Outlier Detection (OD) has received broad interest. Most deep UOD models are trained exclusively on clean datasets to learn the distribution of the normal data, which requires huge manual efforts to clean the real-world data if possible. Instead of relying on clean datasets, some approaches directly train and detect on unlabeled contaminated datasets, leading to the need for methods that are robust to such conditions. Ensemble methods emerged as a superior solution to enhance model robustness against contaminated training sets. However, the training time is greatly increased by the ensemble. In this study, we investigate the impact of outliers on the training phase, aiming to halt training on unlabeled contaminated datasets before performance degradation. Initially, we noted that blending normal and anomalous data causes AUC fluctuations, a label-dependent measure of detection accuracy. To circumvent the need for labels, we propose a zero-label entropy metric named Loss Entropy for loss distribution, enabling us to infer optimal stopping points for training without labels. Meanwhile, we theoretically demonstrate negative correlation between entropy metric and the label-based AUC. Based on this, we develop an automated early-stopping algorithm, EntropyStop, which halts training when loss entropy suggests the maximum model detection capability. We conduct extensive experiments on ADBench (including 47 real datasets), and the overall results indicate that AutoEncoder (AE) enhanced by our approach not only achieves better performance than ensemble AEs but also requires under 1\% of training time. Lastly, our proposed metric and early-stopping approach are evaluated on other deep OD models, exhibiting their broad potential applicability.
Abstract:Self-supervised learning (SSL) provides a promising alternative for representation learning on hypergraphs without costly labels. However, existing hypergraph SSL models are mostly based on contrastive methods with the instance-level discrimination strategy, suffering from two significant limitations: (1) They select negative samples arbitrarily, which is unreliable in deciding similar and dissimilar pairs, causing training bias. (2) They often require a large number of negative samples, resulting in expensive computational costs. To address the above issues, we propose SE-HSSL, a hypergraph SSL framework with three sampling-efficient self-supervised signals. Specifically, we introduce two sampling-free objectives leveraging the canonical correlation analysis as the node-level and group-level self-supervised signals. Additionally, we develop a novel hierarchical membership-level contrast objective motivated by the cascading overlap relationship in hypergraphs, which can further reduce membership sampling bias and improve the efficiency of sample utilization. Through comprehensive experiments on 7 real-world hypergraphs, we demonstrate the superiority of our approach over the state-of-the-art method in terms of both effectiveness and efficiency.
Abstract:Being the most cutting-edge generative methods, diffusion methods have shown great advances in wide generation tasks. Among them, graph generation attracts significant research attention for its broad application in real life. In our survey, we systematically and comprehensively review on diffusion-based graph generative methods. We first make a review on three mainstream paradigms of diffusion methods, which are denoising diffusion probabilistic models, score-based genrative models, and stochastic differential equations. Then we further categorize and introduce the latest applications of diffusion models on graphs. In the end, we point out some limitations of current studies and future directions of future explorations. The summary of existing methods metioned in this survey is in https://github.com/zhejiangzhuque/Diffusion-based-Graph-Generative-Methods.
Abstract:Outlier detection (OD) has received continuous research interests due to its wide applications. With the development of deep learning, increasingly deep OD algorithms are proposed. Despite the availability of numerous deep OD models, existing research has reported that the performance of deep models is extremely sensitive to the configuration of hyperparameters (HPs). However, the selection of HPs for deep OD models remains a notoriously difficult task due to the lack of any labels and long list of HPs. In our study. we shed light on an essential factor, training time, that can introduce significant variation in the performance of deep model. Even the performance is stable across other HPs, training time itself can cause a serious HP sensitivity issue. Motivated by this finding, we are dedicated to formulating a strategy to terminate model training at the optimal iteration. Specifically, we propose a novel metric called loss entropy to internally evaluate the model performance during training while an automated training stopping algorithm is devised. To our knowledge, our approach is the first to enable reliable identification of the optimal training iteration during training without requiring any labels. Our experiments on tabular, image datasets show that our approach can be applied to diverse deep models and datasets. It not only enhances the robustness of deep models to their HPs, but also improves the performance and reduces plenty of training time compared to naive training.