Abstract:Multivariate distributions often carry latent structures that are difficult to identify and estimate, and which better reflect the data generating mechanism than extrinsic structures exhibited simply by the raw data. In this paper, we propose a model-free approach for estimating such latent structures whenever they are present, without assuming they exist a priori. Given an arbitrary density $p_0$, we construct a multiscale representation of the density and propose data-driven methods for selecting representative models that capture meaningful discrete structure. Our approach uses a nonparametric maximum likelihood estimator to estimate the latent structure at different scales and we further characterize their asymptotic limits. By carrying out such a multiscale analysis, we obtain coarseto-fine structures inherent in the original distribution, which are integrated via a model selection procedure to yield an interpretable discrete representation of it. As an application, we design a clustering algorithm based on the proposed procedure and demonstrate its effectiveness in capturing a wide range of latent structures.
Abstract:Knowledge graphs (KGs) have been increasingly employed for link prediction and recommendation using real-world datasets. However, the majority of current methods rely on static data, neglecting the dynamic nature and the hidden spatio-temporal attributes of real-world scenarios. This often results in suboptimal predictions and recommendations. Although there are effective spatio-temporal inference methods, they face challenges such as scalability with large datasets and inadequate semantic understanding, which impede their performance. To address these limitations, this paper introduces a novel framework - Simple Spatio-Temporal Knowledge Graph (SSTKG), for constructing and exploring spatio-temporal KGs. To integrate spatial and temporal data into KGs, our framework exploited through a new 3-step embedding method. Output embeddings can be used for future temporal sequence prediction and spatial information recommendation, providing valuable insights for various applications such as retail sales forecasting and traffic volume prediction. Our framework offers a simple but comprehensive way to understand the underlying patterns and trends in dynamic KG, thereby enhancing the accuracy of predictions and the relevance of recommendations. This work paves the way for more effective utilization of spatio-temporal data in KGs, with potential impacts across a wide range of sectors.
Abstract:Gaussian process regression is a classical kernel method for function estimation and data interpolation. In large data applications, computational costs can be reduced using low-rank or sparse approximations of the kernel. This paper investigates the effect of such kernel approximations on the interpolation error. We introduce a unified framework to analyze Gaussian process regression under important classes of computational misspecification: Karhunen-Lo\`eve expansions that result in low-rank kernel approximations, multiscale wavelet expansions that induce sparsity in the covariance matrix, and finite element representations that induce sparsity in the precision matrix. Our theory also accounts for epistemic misspecification in the choice of kernel parameters.
Abstract:In this paper we propose an algorithm for aligning three-dimensional objects when represented as density maps, motivated by applications in cryogenic electron microscopy. The algorithm is based on minimizing the 1-Wasserstein distance between the density maps after a rigid transformation. The induced loss function enjoys a more benign landscape than its Euclidean counterpart and Bayesian optimization is employed for computation. Numerical experiments show improved accuracy and efficiency over existing algorithms on the alignment of real protein molecules. In the context of aligning heterogeneous pairs, we illustrate a potential need for new distance functions.
Abstract:In recent decades, science and engineering have been revolutionized by a momentous growth in the amount of available data. However, despite the unprecedented ease with which data are now collected and stored, labeling data by supplementing each feature with an informative tag remains to be challenging. Illustrative tasks where the labeling process requires expert knowledge or is tedious and time-consuming include labeling X-rays with a diagnosis, protein sequences with a protein type, texts by their topic, tweets by their sentiment, or videos by their genre. In these and numerous other examples, only a few features may be manually labeled due to cost and time constraints. How can we best propagate label information from a small number of expensive labeled features to a vast number of unlabeled ones? This is the question addressed by semi-supervised learning (SSL). This article overviews recent foundational developments on graph-based Bayesian SSL, a probabilistic framework for label propagation using similarities between features. SSL is an active research area and a thorough review of the extant literature is beyond the scope of this article. Our focus will be on topics drawn from our own research that illustrate the wide range of mathematical tools and ideas that underlie the rigorous study of the statistical accuracy and computational efficiency of graph-based Bayesian SSL.
Abstract:We study uniform consistency in nonparametric mixture models as well as closely related mixture of regression (also known as mixed regression) models, where the regression functions are allowed to be nonparametric and the error distributions are assumed to be convolutions of a Gaussian density. We construct uniformly consistent estimators under general conditions while simultaneously highlighting several pain points in extending existing pointwise consistency results to uniform results. The resulting analysis turns out to be nontrivial, and several novel technical tools are developed along the way. In the case of mixed regression, we prove $L^1$ convergence of the regression functions while allowing for the component regression functions to intersect arbitrarily often, which presents additional technical challenges. We also consider generalizations to general (i.e. non-convolutional) nonparametric mixtures.
Abstract:This paper studies Bayesian nonparametric estimation of a binary regression function in a semi-supervised setting. We assume that the features are supported on a hidden manifold, and use unlabeled data to construct a sequence of graph-based priors over the regression function restricted to the given features. We establish contraction rates for the corresponding graph-based posteriors, interpolated to be supported over regression functions on the underlying manifold. Minimax optimal contraction rates are achieved under certain conditions. Our results provide novel understanding on why and how unlabeled data are helpful in Bayesian semi-supervised classification.
Abstract:Several data analysis techniques employ similarity relationships between data points to uncover the intrinsic dimension and geometric structure of the underlying data-generating mechanism. In this paper we work under the model assumption that the data is made of random perturbations of feature vectors lying on a low-dimensional manifold. We study two questions: how to define the similarity relationship over noisy data points, and what is the resulting impact of the choice of similarity in the extraction of global geometric information from the underlying manifold. We provide concrete mathematical evidence that using a local regularization of the noisy data to define the similarity improves the approximation of the hidden Euclidean distance between unperturbed points. Furthermore, graph-based objects constructed with the locally regularized similarity function satisfy better error bounds in their recovery of global geometric ones. Our theory is supported by numerical experiments that demonstrate that the gain in geometric understanding facilitated by local regularization translates into a gain in classification accuracy in simulated and real data.