Abstract:High-quality labeled datasets are essential for deep learning. Traditional manual annotation methods are not only costly and inefficient but also pose challenges in specialized domains where expert knowledge is needed. Self-supervised methods, despite leveraging unlabeled data for feature extraction, still require hundreds or thousands of labeled instances to guide the model for effective specialized image classification. Current unsupervised learning methods offer automatic classification without prior annotation but often compromise on accuracy. As a result, efficiently procuring high-quality labeled datasets remains a pressing challenge for specialized domain images devoid of annotated data. Addressing this, an unsupervised classification method with three key ideas is introduced: 1) dual-step feature dimensionality reduction using a pre-trained model and manifold learning, 2) a voting mechanism from multiple clustering algorithms, and 3) post-hoc instead of prior manual annotation. This approach outperforms supervised methods in classification accuracy, as demonstrated with fungal image data, achieving 94.1% and 96.7% on public and private datasets respectively. The proposed unsupervised classification method reduces dependency on pre-annotated datasets, enabling a closed-loop for data classification. The simplicity and ease of use of this method will also bring convenience to researchers in various fields in building datasets, promoting AI applications for images in specialized domains.
Abstract:Advances on cryo-electron imaging technologies have led to a rapidly increasing number of density maps. Alignment and comparison of density maps play a crucial role in interpreting structural information, such as conformational heterogeneity analysis using global alignment and atomic model assembly through local alignment. Here, we propose a fast and accurate global and local cryo-electron microscopy density map alignment method CryoAlign, which leverages local density feature descriptors to capture spatial structure similarities. CryoAlign is the first feature-based EM map alignment tool, in which the employment of feature-based architecture enables the rapid establishment of point pair correspondences and robust estimation of alignment parameters. Extensive experimental evaluations demonstrate the superiority of CryoAlign over the existing methods in both alignment accuracy and speed.
Abstract:In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships, and robustness to irrelevant features and missing values in auxiliary information. The parameters in MAFI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large-scale datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analysis. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.
Abstract:Cryo-electron tomography (cryo-ET) is an imaging technique that allows three-dimensional visualization of macro-molecular assemblies under near-native conditions. Cryo-ET comes with a number of challenges, mainly low signal-to-noise and inability to obtain images from all angles. Computational methods are key to analyze cryo-electron tomograms. To promote innovation in computational methods, we generate a novel simulated dataset to benchmark different methods of localization and classification of biological macromolecules in tomograms. Our publicly available dataset contains ten tomographic reconstructions of simulated cell-like volumes. Each volume contains twelve different types of complexes, varying in size, function and structure. In this paper, we have evaluated seven different methods of finding and classifying proteins. Seven research groups present results obtained with learning-based methods and trained on the simulated dataset, as well as a baseline template matching (TM), a traditional method widely used in cryo-ET research. We show that learning-based approaches can achieve notably better localization and classification performance than TM. We also experimentally confirm that there is a negative relationship between particle size and performance for all methods.
Abstract:Knowledge transfer from a source domain to a different but semantically related target domain has long been an important topic in the context of unsupervised domain adaptation (UDA). A key challenge in this field is establishing a metric that can exactly measure the data distribution discrepancy between two homogeneous domains and adopt it in distribution alignment, especially in the matching of feature representations in the hidden activation space. Existing distribution matching approaches can be interpreted as failing to either explicitly orderwise align higher-order moments or satisfy the prerequisite of certain assumptions in practical uses. We propose a novel moment-based probability distribution metric termed dimensional weighted orderwise moment discrepancy (DWMD) for feature representation matching in the UDA scenario. Our metric function takes advantage of a series for high-order moment alignment, and we theoretically prove that our DWMD metric function is error-free, which means that it can strictly reflect the distribution differences between domains and is valid without any feature distribution assumption. In addition, since the discrepancies between probability distributions in each feature dimension are different, dimensional weighting is considered in our function. We further calculate the error bound of the empirical estimate of the DWMD metric in practical applications. Comprehensive experiments on benchmark datasets illustrate that our method yields state-of-the-art distribution metrics.
Abstract:Electron tomography (ET) allows high-resolution reconstructions of macromolecular complexes at nearnative state. Cellular structures segmentation in the reconstruction data from electron tomographic images is often required for analyzing and visualizing biological structures, making it a powerful tool for quantitative descriptions of whole cell structures and understanding biological functions. However, these cellular structures are rather difficult to automatically separate or quantify from view owing to complex molecular environment and the limitations of reconstruction data of ET. In this paper, we propose a single end-to-end deep fully-convolutional semantic segmentation network dubbed SegET with rich contextual features which fully exploitsthe multi-scale and multi-level contextual information and reduces the loss of details of cellular structures in ET images. We trained and evaluated our network on the electron tomogram of the CTL Immunological Synapse from Cell Image library. Our results demonstrate that SegET can automatically segment accurately and outperform all other baseline methods on each individual structure in our ET dataset.
Abstract:Super-resolution fluorescence microscopy, with a resolution beyond the diffraction limit of light, has become an indispensable tool to directly visualize biological structures in living cells at a nanometer-scale resolution. Despite advances in high-density super-resolution fluorescent techniques, existing methods still have bottlenecks, including extremely long execution time, artificial thinning and thickening of structures, and lack of ability to capture latent structures. Here we propose a novel deep learning guided Bayesian inference approach, DLBI, for the time-series analysis of high-density fluorescent images. Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution of the fluorophores that are consistent with the observed time-series fluorescent images by exploring local features and correlation along time-axis, and statistical inference further refines the ultrastructure extracted by deep learning and endues physical meaning to the final image. Comprehensive experimental results on both real and simulated datasets demonstrate that our method provides more accurate and realistic local patch and large-field reconstruction than the state-of-the-art method, the 3B analysis, while our method is more than two orders of magnitude faster. The main program is available at https://github.com/lykaust15/DLBI