Abstract:In the rapidly evolving field of metabolic engineering, the quest for efficient and precise gene target identification for metabolite production enhancement presents significant challenges. Traditional approaches, whether knowledge-based or model-based, are notably time-consuming and labor-intensive, due to the vast scale of research literature and the approximation nature of genome-scale metabolic model (GEM) simulations. Therefore, we propose a new task, Gene-Metabolite Association Prediction based on metabolic graphs, to automate the process of candidate gene discovery for a given pair of metabolite and candidate-associated genes, as well as presenting the first benchmark containing 2474 metabolites and 1947 genes of two commonly used microorganisms Saccharomyces cerevisiae (SC) and Issatchenkia orientalis (IO). This task is challenging due to the incompleteness of the metabolic graphs and the heterogeneity among distinct metabolisms. To overcome these limitations, we propose an Interactive Knowledge Transfer mechanism based on Metabolism Graph (IKT4Meta), which improves the association prediction accuracy by integrating the knowledge from different metabolism graphs. First, to build a bridge between two graphs for knowledge transfer, we utilize Pretrained Language Models (PLMs) with external knowledge of genes and metabolites to help generate inter-graph links, significantly alleviating the impact of heterogeneity. Second, we propagate intra-graph links from different metabolic graphs using inter-graph links as anchors. Finally, we conduct the gene-metabolite association prediction based on the enriched metabolism graphs, which integrate the knowledge from multiple microorganisms. Experiments on both types of organisms demonstrate that our proposed methodology outperforms baselines by up to 12.3% across various link prediction frameworks.
Abstract:Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height, and structure, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Existing large-scale building datasets lack accuracy, extensibility and indicative attributes. This paper presents a geospatial artificial intelligence (GeoAI) framework for large-scale building modeling, introducing the first Multi-Attribute Building dataset (CMAB) in China at a national scale. The dataset covers 3,667 natural cities with a total rooftop area of 21.3 billion square meters with an F1-Score of 89.93% in rooftop extraction through the OCRNet. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating building features such as morphology, location, and function. Using multi-source data, including billions of high-resolution Google Earth imagery and 60 million street view images (SVI), we generated rooftop, height, function, age, and quality attributes for each building. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation. The results support urban planning and sustainable development.
Abstract:Fine-grained few-shot entity extraction in the chemical domain faces two unique challenges. First, compared with entity extraction tasks in the general domain, sentences from chemical papers usually contain more entities. Moreover, entity extraction models usually have difficulty extracting entities of long-tailed types. In this paper, we propose Chem-FINESE, a novel sequence-to-sequence (seq2seq) based few-shot entity extraction approach, to address these two challenges. Our Chem-FINESE has two components: a seq2seq entity extractor to extract named entities from the input sentence and a seq2seq self-validation module to reconstruct the original input sentence from extracted entities. Inspired by the fact that a good entity extraction system needs to extract entities faithfully, our new self-validation module leverages entity extraction results to reconstruct the original input sentence. Besides, we design a new contrastive loss to reduce excessive copying during the extraction process. Finally, we release ChemNER+, a new fine-grained chemical entity extraction dataset that is annotated by domain experts with the ChemNER schema. Experiments in few-shot settings with both ChemNER+ and CHEMET datasets show that our newly proposed framework has contributed up to 8.26% and 6.84% absolute F1-score gains respectively.
Abstract:With higher-order neighborhood information of graph network, the accuracy of graph representation learning classification can be significantly improved. However, the current higher order graph convolutional network has a large number of parameters and high computational complexity. Therefore, we propose a Hybrid Lower order and Higher order Graph convolutional networks (HLHG) learning model, which uses weight sharing mechanism to reduce the number of network parameters. To reduce computational complexity, we propose a novel fusion pooling layer to combine the neighborhood information of high order and low order. Theoretically, we compare the model complexity of the proposed model with the other state-of-the-art model. Experimentally, we verify the proposed model on the large-scale text network datasets by supervised learning, and on the citation network datasets by semi-supervised learning. The experimental results show that the proposed model achieves highest classification accuracy with a small set of trainable weight parameters.