Abstract:Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model's ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at https://github.com/kramerlab/4StepFocus.
Abstract:Deep learning (DL) approaches are achieving extraordinary results in a wide range of domains but often require a massive collection of private data. Hence, methods for training neural networks on the joint data of different data owners, that keep each party's input confidential, are called for. We address the setting of horizontally distributed data in deep learning, where the participants' vulnerable intermediate results have to be processed in a privacy-preserving manner. The predominant scheme for this setting is based on homomorphic encryption (HE), and it is widely considered to be without alternative. In contrast to this, we demonstrate that a carefully chosen, less complex and computationally less expensive secure sum protocol in conjunction with default secure channels exhibits superior properties in terms of both collusion-resistance and runtime. Finally, we discuss several open research questions in the context of collaborative DL, which possibly might lead back to HE-based solutions.