Abstract:In this study, we present Flatsomatic - a Variational Auto Encoder (VAE) optimized to compress somatic mutations that allow for unbiased data compression whilst maintaining the signal. We compared two different neural network architectures for the VAE: Multilayer Perceptron (MLP) and bidirectional LSTM. The somatic profiles we used to train our models consisted of 8,062 Pan-Cancer patients from The Cancer Genome Atlas and 989 cell lines from the COSMIC cell line project. The profiles for each patient were represented by the genomic loci where somatic mutations occurred and, to reduce sparsity, the locations with a frequency <5 were removed. We enhanced the VAE performance by changing its evidence lower bound, and devised an F1-score based loss showing that it helps the VAE learn better than with binary cross-entropy. We also employed beta-VAE to weight the variational regularisation term in the loss function and showed the best performance through a preliminary function to increase the weight of the regularisation term with each epoch. We assessed the reconstruction ability of the VAE using the micro F1-score metric and showed that our best performing model was a 2-layer deep MLP VAE. Our analysis also showed that the size of the latent space did not have a significant effect on the VAE learning ability. We compared the Flatsomatic embeddings created to a lower dimension version of the data from principal component analysis, showing superior performance of Flatsomatic, and performed K-means clustering on both datasets to draw comparisons to known cancer types of each profile. Finally, we present results that confirm that the Flatsomatic representations of 64 dimensions maintain the same predictive power as the original 8,298 dimensions vector, through prediction of drug response.
Abstract:Analysis of somatic mutation profiles from cancer patients is essential in the development of cancer research. However, the low frequency of most mutations and the varying rates of mutations across patients makes the data extremely challenging to statistically analyze as well as difficult to use in classification problems, for clustering, visualization or for learning useful information. Thus, the creation of low dimensional representations of somatic mutation profiles that hold useful information about the DNA of cancer cells will facilitate the use of such data in applications that will progress precision medicine. In this paper, we talk about the open problem of learning from somatic mutations, and present Flatsomatic: a solution that utilizes variational autoencoders (VAEs) to create latent representations of somatic profiles. The work done in this paper shows great potential for this method, with the VAE embeddings performing better than PCA for a clustering task, and performing equally well to the raw high dimensional data for a classification task. We believe the methods presented herein can be of great value in future research and in bringing data-driven models into precision oncology.