Abstract:Medical imaging is key in modern medicine. From magnetic resonance imaging (MRI) to microscopic imaging for blood cell detection, diagnostic medical imaging reveals vital insights into patient health. To predict diseases or provide individualized therapies, machine learning techniques like kernel methods have been widely used. Nevertheless, there are multiple challenges for implementing kernel methods. Medical image data often originates from various hospitals and cannot be combined due to privacy concerns, and the high dimensionality of image data presents another significant obstacle. While randomised encoding offers a promising direction, existing methods often struggle with a trade-off between accuracy and efficiency. Addressing the need for efficient privacy-preserving methods on distributed image data, we introduce OKRA (Orthonormal K-fRAmes), a novel randomized encoding-based approach for kernel-based machine learning. This technique, tailored for widely used kernel functions, significantly enhances scalability and speed compared to current state-of-the-art solutions. Through experiments conducted on various clinical image datasets, we evaluated model quality, computational performance, and resource overhead. Additionally, our method outperforms comparable approaches
Abstract:Machine learning on large-scale genomic or transcriptomic data is important for many novel health applications. For example, precision medicine tailors medical treatments to patients on the basis of individual biomarkers, cellular and molecular states, etc. However, the data required is sensitive, voluminous, heterogeneous, and typically distributed across locations where dedicated machine learning hardware is not available. Due to privacy and regulatory reasons, it is also problematic to aggregate all data at a trusted third party.Federated learning is a promising solution to this dilemma, because it enables decentralized, collaborative machine learning without exchanging raw data. In this paper, we perform comparative experiments with the federated learning frameworks TensorFlow Federated and Flower. Our test case is the training of disease prognosis and cell type classification models. We train the models with distributed transcriptomic data, considering both data heterogeneity and architectural heterogeneity. We measure model quality, robustness against privacy-enhancing noise, computational performance and resource overhead. Each of the federated learning frameworks has different strengths. However, our experiments confirm that both frameworks can readily build models on transcriptomic data, without transferring personal raw data to a third party with abundant computational resources.
Abstract:It is challenging to implement Kernel methods, if the data sources are distributed and cannot be joined at a trusted third party for privacy reasons. It is even more challenging, if the use case rules out privacy-preserving approaches that introduce noise. An example for such a use case is machine learning on clinical data. To realize exact privacy preserving computation of kernel methods, we propose FLAKE, a Federated Learning Approach for KErnel methods on horizontally distributed data. With FLAKE, the data sources mask their data so that a centralized instance can compute a Gram matrix without compromising privacy. The Gram matrix allows to calculate many kernel matrices, which can be used to train kernel-based machine learning algorithms such as Support Vector Machines. We prove that FLAKE prevents an adversary from learning the input data or the number of input features under a semi-honest threat model. Experiments on clinical and synthetic data confirm that FLAKE is outperforming the accuracy and efficiency of comparable methods. The time needed to mask the data and to compute the Gram matrix is several orders of magnitude less than the time a Support Vector Machine needs to be trained. Thus, FLAKE can be applied to many use cases.