Abstract:Optimal transport (OT) and the related Wasserstein metric (W) are powerful and ubiquitous tools for comparing distributions. However, computing pairwise Wasserstein distances rapidly becomes intractable as cohort size grows. An attractive alternative would be to find an embedding space in which pairwise Euclidean distances map to OT distances, akin to standard multidimensional scaling (MDS). We present Wasserstein Wormhole, a transformer-based autoencoder that embeds empirical distributions into a latent space wherein Euclidean distances approximate OT distances. Extending MDS theory, we show that our objective function implies a bound on the error incurred when embedding non-Euclidean distances. Empirically, distances between Wormhole embeddings closely match Wasserstein distances, enabling linear time computation of OT distances. Along with an encoder that maps distributions to embeddings, Wasserstein Wormhole includes a decoder that maps embeddings back to distributions, allowing for operations in the embedding space to generalize to OT spaces, such as Wasserstein barycenter estimation and OT interpolation. By lending scalability and interpretability to OT approaches, Wasserstein Wormhole unlocks new avenues for data analysis in the fields of computational geometry and single-cell biology.
Abstract:We present a Bayesian hierarchical multi-view mixture model termed Symphony that simultaneously learns clusters of cells representing cell types and their underlying gene regulatory networks by integrating data from two views: single-cell gene expression data and paired epigenetic data, which is informative of gene-gene interactions. This model improves interpretation of clusters as cell types with similar expression patterns as well as regulatory networks driving expression, by explaining gene-gene covariances with the biological machinery regulating gene expression. We show the theoretical advantages of the multi-view learning approach and present a Variational EM inference procedure. We demonstrate superior performance on both synthetic data and real genomic data with subtypes of peripheral blood cells compared to other methods.