Abstract:The recent success of Large Language Models (LLM) in a wide range of Natural Language Processing applications opens the path towards novel Question Answering Systems over Knowledge Graphs leveraging LLMs. However, one of the main obstacles preventing their implementation is the scarcity of training data for the task of translating questions into corresponding SPARQL queries, particularly in the case of domain-specific KGs. To overcome this challenge, in this study, we evaluate several strategies for fine-tuning the OpenLlama LLM for question answering over life science knowledge graphs. In particular, we propose an end-to-end data augmentation approach for extending a set of existing queries over a given knowledge graph towards a larger dataset of semantically enriched question-to-SPARQL query pairs, enabling fine-tuning even for datasets where these pairs are scarce. In this context, we also investigate the role of semantic "clues" in the queries, such as meaningful variable names and inline comments. Finally, we evaluate our approach over the real-world Bgee gene expression knowledge graph and we show that semantic clues can improve model performance by up to 33% compared to a baseline with random variable names and no comments included.
Abstract:We introduce the RIKEN Microstructural Imaging Metadatabase, a semantic web-based imaging database in which image metadata are described using the Resource Description Framework (RDF) and detailed biological properties observed in the images can be represented as Linked Open Data. The metadata are used to develop a large-scale imaging viewer that provides a straightforward graphical user interface to visualise a large microstructural tiling image at the gigabyte level. We applied the database to accumulate comprehensive microstructural imaging data produced by automated scanning electron microscopy. As a result, we have successfully managed vast numbers of images and their metadata, including the interpretation of morphological phenotypes occurring in sub-cellular components and biosamples captured in the images. We also discuss advanced utilisation of morphological imaging data that can be promoted by this database.
Abstract:Imaging data is one of the most important fundamentals in the current life sciences. We aimed to construct an ontology to describe imaging metadata as a data schema of the integrated database for optical and electron microscopy images combined with various bio-entities. To realise this, we applied Resource Description Framework (RDF) to an Open Microscopy Environment (OME) data model, which is the de facto standard to describe optical microscopy images and experimental data. We translated the XML-based OME metadata into the base concept of RDF schema as a trial of developing microscopy ontology. In this ontology, we propose 18 upper-level concepts including missing concepts in OME such as electron microscopy, phenotype data, biosample, and imaging conditions.