Abstract:This paper studies computationally efficient methods and their minimax optimality for high-dimensional clustering and signal recovery under block signal structures. We propose two sets of methods, cross-block feature aggregation PCA (CFA-PCA) and moving average PCA (MA-PCA), designed for sparse and dense block signals, respectively. Both methods adaptively utilize block signal structures, applicable to non-Gaussian data with heterogeneous variances and non-diagonal covariance matrices. Specifically, the CFA method utilizes a block-wise U-statistic to aggregate and select block signals non-parametrically from data with unknown cluster labels. We show that the proposed methods are consistent for both clustering and signal recovery under mild conditions and weaker signal strengths than the existing methods without considering block structures of signals. Furthermore, we derive both statistical and computational minimax lower bounds (SMLB and CMLB) for high-dimensional clustering and signal recovery under block signals, where the CMLBs are restricted to algorithms with polynomial computation complexity. The minimax boundaries partition signals into regions of impossibility and possibility. No algorithm (or no polynomial time algorithm) can achieve consistent clustering or signal recovery if the signals fall into the statistical (or computational) region of impossibility. We show that the proposed CFA-PCA and MA-PCA methods can achieve the CMLBs for the sparse and dense block signal regimes, respectively, indicating the proposed methods are computationally minimax optimal. A tuning parameter selection method is proposed based on post-clustering signal recovery results. Simulation studies are conducted to evaluate the proposed methods. A case study on global temperature change demonstrates their utility in practice.
Abstract:Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery.
Abstract:Testing conditional independence has many applications, such as in Bayesian network learning and causal discovery. Different test methods have been proposed. However, existing methods generally can not work when only discretized observations are available. Specifically, consider $X_1$, $\tilde{X}_2$ and $X_3$ are observed variables, where $\tilde{X}_2$ is a discretization of latent variables $X_2$. Applying existing test methods to the observations of $X_1$, $\tilde{X}_2$ and $X_3$ can lead to a false conclusion about the underlying conditional independence of variables $X_1$, $X_2$ and $X_3$. Motivated by this, we propose a conditional independence test specifically designed to accommodate the presence of such discretization. To achieve this, we design the bridge equations to recover the parameter reflecting the statistical information of the underlying latent continuous variables. An appropriate test statistic and its asymptotic distribution under the null hypothesis of conditional independence have also been derived. Both theoretical results and empirical validation have been provided, demonstrating the effectiveness of our test methods.