Abstract:Confirmation bias, the tendency to interpret information in a way that aligns with one's preconceptions, can profoundly impact scientific research, leading to conclusions that reflect the researcher's hypotheses even when the observational data do not support them. This issue is especially critical in scientific fields involving highly noisy observations, such as cryo-electron microscopy. This study investigates confirmation bias in Gaussian mixture models. We consider the following experiment: A team of scientists assumes they are analyzing data drawn from a Gaussian mixture model with known signals (hypotheses) as centroids. However, in reality, the observations consist entirely of noise without any informative structure. The researchers use a single iteration of the K-means or expectation-maximization algorithms, two popular algorithms to estimate the centroids. Despite the observations being pure noise, we show that these algorithms yield biased estimates that resemble the initial hypotheses, contradicting the unbiased expectation that averaging these noise observations would converge to zero. Namely, the algorithms generate estimates that mirror the postulated model, although the hypotheses (the presumed centroids of the Gaussian mixture) are not evident in the observations. Specifically, among other results, we prove a positive correlation between the estimates produced by the algorithms and the corresponding hypotheses. We also derive explicit closed-form expressions of the estimates for a finite and infinite number of hypotheses. This study underscores the risks of confirmation bias in low signal-to-noise environments, provides insights into potential pitfalls in scientific methodologies, and highlights the importance of prudent data interpretation.
Abstract:``Einstein from noise" (EfN) is a prominent example of the model bias phenomenon: systematic errors in the statistical model that lead to erroneous but consistent estimates. In the EfN experiment, one falsely believes that a set of observations contains noisy, shifted copies of a template signal (e.g., an Einstein image), whereas in reality, it contains only pure noise observations. To estimate the signal, the observations are first aligned with the template using cross-correlation, and then averaged. Although the observations contain nothing but noise, it was recognized early on that this process produces a signal that resembles the template signal! This pitfall was at the heart of a central scientific controversy about validation techniques in structural biology. This paper provides a comprehensive statistical analysis of the EfN phenomenon above. We show that the Fourier phases of the EfN estimator (namely, the average of the aligned noise observations) converge to the Fourier phases of the template signal, explaining the observed structural similarity. Additionally, we prove that the convergence rate is inversely proportional to the number of noise observations and, in the high-dimensional regime, to the Fourier magnitudes of the template signal. Moreover, in the high-dimensional regime, the Fourier magnitudes converge to a scaled version of the template signal's Fourier magnitudes. This work not only deepens the theoretical understanding of the EfN phenomenon but also highlights potential pitfalls in template matching techniques and emphasizes the need for careful interpretation of noisy observations across disciplines in engineering, statistics, physics, and biology.
Abstract:Semi-algebraic priors are ubiquitous in signal processing and machine learning. Prevalent examples include a) linear models where the signal lies in a low-dimensional subspace; b) sparse models where the signal can be represented by only a few coefficients under a suitable basis; and c) a large family of neural network generative models. In this paper, we prove a transversality theorem for semi-algebraic sets in orthogonal or unitary representations of groups: with a suitable dimension bound, a generic translate of any semi-algebraic set is transverse to the orbits of the group action. This, in turn, implies that if a signal lies in a low-dimensional semi-algebraic set, then it can be recovered uniquely from measurements that separate orbits. As an application, we consider the implications of the transversality theorem to the problem of recovering signals that are translated by random group actions from their second moment. As a special case, we discuss cryo-EM: a leading technology to constitute the spatial structure of biological molecules, which serves as our prime motivation. In particular, we derive explicit bounds for recovering a molecular structure from the second moment under a semi-algebraic prior and deduce information-theoretic implications. We also obtain information-theoretic bounds for three additional applications: factoring Gram matrices, multi-reference alignment, and phase retrieval. Finally, we deduce bounds for designing permutation invariant separators in machine learning.
Abstract:The classical beltway problem entails recovering a set of points from their unordered pairwise distances on the circle. This problem can be viewed as a special case of the crystallographic phase retrieval problem of recovering a sparse signal from its periodic autocorrelation. Based on this interpretation, and motivated by cryo-electron microscopy, we suggest a natural generalization to orthogonal groups: recovering a sparse signal, up to an orthogonal transformation, from its autocorrelation over the orthogonal group. If the support of the signal is collision-free, we bound the number of solutions to the beltway problem over orthogonal groups, and prove that this bound is exactly one when the support of the signal is radially collision-free (i.e., the support points have distinct magnitudes). We also prove that if the pairwise products of the signal's weights are distinct, then the autocorrelation determines the signal uniquely, up to an orthogonal transformation. We conclude the paper by considering binary signals and show that in this case, the collision-free condition need not be sufficient to determine signals up to orthogonal transformation.
Abstract:This paper studies the classical problem of detecting the location of multiple image occurrences in a two-dimensional, noisy measurement. Assuming the image occurrences do not overlap, we formulate this task as a constrained maximum likelihood optimization problem. We show that the maximum likelihood estimator is equivalent to an instance of the winner determination problem from the field of combinatorial auction, and that the solution can be obtained by searching over a binary tree. We then design a pruning mechanism that significantly accelerates the runtime of the search. We demonstrate on simulations and electron microscopy data sets that the proposed algorithm provides accurate detection in challenging regimes of high noise levels and densely packed image occurrences.
Abstract:Multi-target detection (MTD) is the problem of estimating an image from a large, noisy measurement that contains randomly translated and rotated copies of the image. Motivated by the single-particle cryo-electron microscopy technology, we design data-driven diffusion priors for the MTD problem, derived from score-based stochastic differential equations models. We then integrate the prior into the approximate expectation-maximization algorithm. In particular, our method alternates between an expectation step that approximates the expected log-likelihood and a maximization step that balances the approximated log-likelihood with the learned log-prior. We show on two datasets that adding the data-driven prior substantially reduces the estimation error, in particular in high noise regimes.
Abstract:The key ingredient to retrieving a signal from its Fourier magnitudes, namely, to solve the phase retrieval problem, is an effective prior on the sought signal. In this paper, we study the phase retrieval problem under the prior that the signal lies in a semi-algebraic set. This is a very general prior as semi-algebraic sets include linear models, sparse models, and ReLU neural network generative models. The latter is the main motivation of this paper, due to the remarkable success of deep generative models in a variety of imaging tasks, including phase retrieval. We prove that almost all signals in R^N can be determined from their Fourier magnitudes, up to a sign, if they lie in a (generic) semi-algebraic set of dimension N/2. The same is true for all signals if the semi-algebraic set is of dimension N/4. We also generalize these results to the problem of signal recovery from the second moment in multi-reference alignment models with multiplicity free representations of compact groups. This general result is then used to derive improved sample complexity bounds for recovering band-limited functions on the sphere from their noisy copies, each acted upon by a random element of SO(3).
Abstract:A single-particle cryo-electron microscopy (cryo-EM) measurement, called a micrograph, consists of multiple two-dimensional tomographic projections of a three-dimensional molecular structure at unknown locations, taken under unknown viewing directions. All existing cryo-EM algorithmic pipelines first locate and extract the projection images, and then reconstruct the structure from the extracted images. However, if the molecular structure is small, the signal-to-noise ratio (SNR) of the data is very low, and thus accurate detection of projection images within the micrograph is challenging. Consequently, all standard techniques fail in low-SNR regimes. To recover molecular structures from measurements of low SNR, and in particular small molecular structures, we devise a stochastic approximate expectation-maximization algorithm to estimate the three-dimensional structure directly from the micrograph, bypassing locating the projection images. We corroborate our computational scheme with numerical experiments, and present successful structure recoveries from simulated noisy measurements.
Abstract:We consider the finite alphabet phase retrieval problem: recovering a signal whose entries lie in a small alphabet of possible values from its Fourier magnitudes. This problem arises in the celebrated technology of X-ray crystallography to determine the atomic structure of biological molecules. Our main result states that for generic values of the alphabet, two signals have the same Fourier magnitudes if and only if several partitions have the same difference sets. Thus, the finite alphabet phase retrieval problem reduces to the combinatorial problem of determining a signal from those difference sets. Notably, this result holds true when one of the letters of the alphabet is zero, namely, for sparse signals with finite alphabet, which is the situation in X-ray crystallography.
Abstract:Different tasks in the computational pipeline of single-particle cryo-electron microscopy (cryo-EM) require enhancing the quality of the highly noisy raw images. To this end, we develop an efficient algorithm for signal enhancement of cryo-EM images. The enhanced images can be used for a variety of downstream tasks, such as 2-D classification, removing uninformative images, constructing {ab initio} models, generating templates for particle picking, providing a quick assessment of the data set, dimensionality reduction, and symmetry detection. The algorithm includes built-in quality measures to assess its performance and alleviate the risk of model bias. We demonstrate the effectiveness of the proposed algorithm on several experimental data sets. In particular, we show that the quality of the resulting images is high enough to produce ab initio models of $\sim 10$ \AA resolution. The algorithm is accompanied by a publicly available, documented and easy-to-use code.