Abstract:The maximum mean discrepancy and Wasserstein distance are popular distance measures between distributions and play important roles in many machine learning problems such as metric learning, generative modeling, domain adaption, and clustering. However, since they are functions of pair-wise distances between data points in two distributions, they do not exploit the potential manifold properties of data such as smoothness and hence are not effective in measuring the dissimilarity between the two distributions in the form of manifolds. In this paper, different from existing measures, we propose a novel distance called Mutual Regression Distance (MRD) induced by a constrained mutual regression problem, which can exploit the manifold property of data. We prove that MRD is a pseudometric that satisfies almost all the axioms of a metric. Since the optimization of the original MRD is costly, we provide a tight MRD and a simplified MRD, based on which a heuristic algorithm is established. We also provide kernel variants of MRDs that are more effective in handling nonlinear data. Our MRDs especially the simplified MRDs have much lower computational complexity than the Wasserstein distance. We provide theoretical guarantees, such as robustness, for MRDs. Finally, we apply MRDs to distribution clustering, generative models, and domain adaptation. The numerical results demonstrate the effectiveness and superiority of MRDs compared to the baselines.
Abstract:High-dimensional data visualization is crucial in the big data era and these techniques such as t-SNE and UMAP have been widely used in science and engineering. Big data, however, is often distributed across multiple data centers and subject to security and privacy concerns, which leads to difficulties for the standard algorithms of t-SNE and UMAP. To tackle the challenge, this work proposes Fed-tSNE and Fed-UMAP, which provide high-dimensional data visualization under the framework of federated learning, without exchanging data across clients or sending data to the central server. The main idea of Fed-tSNE and Fed-UMAP is implicitly learning the distribution information of data in a manner of federated learning and then estimating the global distance matrix for t-SNE and UMAP. To further enhance the protection of data privacy, we propose Fed-tSNE+ and Fed-UMAP+. We also extend our idea to federated spectral clustering, yielding algorithms of clustering distributed data. In addition to these new algorithms, we offer theoretical guarantees of optimization convergence, distance and similarity estimation, and differential privacy. Experiments on multiple datasets demonstrate that, compared to the original algorithms, the accuracy drops of our federated algorithms are tiny.
Abstract:Discrete distribution clustering (D2C) was often solved by Wasserstein barycenter methods. These methods are under a common assumption that clusters can be well represented by barycenters, which may not hold in many real applications. In this work, we propose a simple yet effective framework based on spectral clustering and distribution affinity measures (e.g., maximum mean discrepancy and Wasserstein distance) for D2C. To improve the scalability, we propose to use linear optimal transport to construct affinity matrices efficiently on large datasets. We provide theoretical guarantees for the success of the proposed methods in clustering distributions. Experiments on synthetic and real data show that our methods outperform the baselines largely in terms of both clustering accuracy and computational efficiency.