Abstract:Graphs are ubiquitous in real-world scenarios and encompass a diverse range of tasks, from node-, edge-, and graph-level tasks to transfer learning. However, designing specific tasks for each type of graph data is often costly and lacks generalizability. Recent endeavors under the "Pre-training + Fine-tuning" or "Pre-training + Prompt" paradigms aim to design a unified framework capable of generalizing across multiple graph tasks. Among these, graph autoencoders (GAEs), generative self-supervised models, have demonstrated their potential in effectively addressing various graph tasks. Nevertheless, these methods typically employ multi-stage training and require adaptive designs, which on one hand make it difficult to be seamlessly applied to diverse graph tasks and on the other hand overlook the negative impact caused by discrepancies in task objectives between the different stages. To address these challenges, we propose GA^2E, a unified adversarially masked autoencoder capable of addressing the above challenges seamlessly. Specifically, GA^2E proposes to use the subgraph as the meta-structure, which remains consistent across all graph tasks (ranging from node-, edge-, and graph-level to transfer learning) and all stages (both during training and inference). Further, GA^2E operates in a \textbf{"Generate then Discriminate"} manner. It leverages the masked GAE to reconstruct the input subgraph whilst treating it as a generator to compel the reconstructed graphs resemble the input subgraph. Furthermore, GA^2E introduces an auxiliary discriminator to discern the authenticity between the reconstructed (generated) subgraph and the input subgraph, thus ensuring the robustness of the graph representation through adversarial training mechanisms. We validate GA^2E's capabilities through extensive experiments on 21 datasets across four types of graph tasks.
Abstract:Class imbalance in graph data poses significant challenges for node classification. Existing methods, represented by SMOTE-based approaches, partially alleviate this issue but still exhibit limitations during imbalanced scenario construction. Self-supervised learning (SSL) offers a promising solution by synthesizing minority nodes from the data itself, yet its potential remains unexplored. In this paper, we analyze the limitations of SMOTE-based approaches and introduce VIGraph, a novel SSL model based on the self-supervised Variational Graph Auto-Encoder (VGAE) that leverages Variational Inference (VI) to generate minority nodes. Specifically, VIGraph strictly adheres to the concept of imbalance when constructing imbalanced graphs and utilizes the generative VGAE to generate minority nodes. Moreover, VIGraph introduces a novel Siamese contrastive strategy at the decoding phase to improve the overall quality of generated nodes. VIGraph can generate high-quality nodes without reintegrating them into the original graph, eliminating the "Generating, Reintegrating, and Retraining" process found in SMOTE-based methods. Experiments on multiple real-world datasets demonstrate that VIGraph achieves promising results for class-imbalanced node classification tasks.
Abstract:In recent years, contrastive learning has emerged as a dominant self-supervised paradigm, attracting numerous research interests in the field of graph learning. Graph contrastive learning (GCL) aims to embed augmented anchor samples close to each other while pushing the embeddings of other samples (negative samples) apart. However, existing GCL methods require large and diverse negative samples to ensure the quality of embeddings, and recent studies typically leverage samples excluding the anchor and positive samples as negative samples, potentially introducing false negative samples (negatives that share the same class as the anchor). Additionally, this practice can result in heavy computational burden and high time complexity of $O(N^2)$, which is particularly unaffordable for large graphs. To address these deficiencies, we leverage rank learning and propose a simple yet effective model, GraphRank. Specifically, we first generate two graph views through corruption. Then, we compute the similarity of pairwise nodes (anchor node and positive node) in both views, an arbitrary node in the latter view is selected as a negative node, and its similarity with the anchor node is computed. Based on this, we introduce rank-based learning to measure similarity scores which successfully relieve the false negative provlem and decreases the time complexity from $O(N^2)$ to $O(N)$. Moreover, we conducted extensive experiments across multiple graph tasks, demonstrating that GraphRank performs favorably against other cutting-edge GCL methods in various tasks.
Abstract:Recent advancements in large language models (LLMs) have demonstrated remarkable abilities in handling a variety of natural language processing (NLP) downstream tasks, even on mathematical tasks requiring multi-step reasoning. In this report, we introduce the KwaiYiiMath which enhances the mathematical reasoning abilities of KwaiYiiBase1, by applying Supervised Fine-Tuning (SFT) and Reinforced Learning from Human Feedback (RLHF), including on both English and Chinese mathematical tasks. Meanwhile, we also constructed a small-scale Chinese primary school mathematics test set (named KMath), consisting of 188 examples to evaluate the correctness of the problem-solving process generated by the models. Empirical studies demonstrate that KwaiYiiMath can achieve state-of-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with the similar size models, respectively.
Abstract:Generative self-supervised learning (SSL) has exhibited significant potential and garnered increasing interest in graph learning. In this study, we aim to explore the problem of generative SSL in the context of heterogeneous graph learning (HGL). The previous SSL approaches for heterogeneous graphs have primarily relied on contrastive learning, necessitating the design of complex views to capture heterogeneity. However, existing generative SSL methods have not fully leveraged the capabilities of generative models to address the challenges of HGL. In this paper, we present HGCVAE, a novel contrastive variational graph auto-encoder that liberates HGL from the burden of intricate heterogeneity capturing. Instead of focusing on complicated heterogeneity, HGCVAE harnesses the full potential of generative SSL. HGCVAE innovatively consolidates contrastive learning with generative SSL, introducing several key innovations. Firstly, we employ a progressive mechanism to generate high-quality hard negative samples for contrastive learning, utilizing the power of variational inference. Additionally, we present a dynamic mask strategy to ensure effective and stable learning. Moreover, we propose an enhanced scaled cosine error as the criterion for better attribute reconstruction. As an initial step in combining generative and contrastive SSL, HGCVAE achieves remarkable results compared to various state-of-the-art baselines, confirming its superiority.
Abstract:Molecular property prediction has gained significant attention due to its transformative potential in multiple scientific disciplines. Conventionally, a molecule graph can be represented either as a graph-structured data or a SMILES text. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. Although it is natural to utilize LLMs to assist in understanding molecules represented by SMILES, the exploration of how LLMs will impact molecular property prediction is still in its early stage. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules. To be specific, we first prompt LLMs to do in-context molecular classification and evaluate their performance. After that, we employ LLMs to generate semantically enriched explanations for the original SMILES and then leverage that to fine-tune a small-scale LM model for multiple downstream tasks. The experimental results highlight the superiority of text explanations as molecular representations across multiple benchmark datasets, and confirm the immense potential of LLMs in molecular property prediction tasks. Codes are available at \url{https://github.com/ChnQ/LLM4Mol}.