Abstract:The field of protein folding research has been greatly advanced by deep learning methods, with AlphaFold2 (AF2) demonstrating exceptional performance and atomic-level precision. As co-evolution is integral to protein structure prediction, AF2's accuracy is significantly influenced by the depth of multiple sequence alignment (MSA), which requires extensive exploration of a large protein database for similar sequences. However, not all protein sequences possess abundant homologous families, and consequently, AF2's performance can degrade on such queries, at times failing to produce meaningful results. To address this, we introduce a novel generative language model, MSA-Augmenter, which leverages protein-specific attention mechanisms and large-scale MSAs to generate useful, novel protein sequences not currently found in databases. These sequences supplement shallow MSAs, enhancing the accuracy of structural property predictions. Our experiments on CASP14 demonstrate that MSA-Augmenter can generate de novo sequences that retain co-evolutionary information from inferior MSAs, thereby improving protein structure prediction quality on top of strong AF2.
Abstract:RNA structure determination and prediction can promote RNA-targeted drug development and engineerable synthetic elements design. But due to the intrinsic structural flexibility of RNAs, all the three mainstream structure determination methods (X-ray crystallography, NMR, and Cryo-EM) encounter challenges when resolving the RNA structures, which leads to the scarcity of the resolved RNA structures. Computational prediction approaches emerge as complementary to the experimental techniques. However, none of the \textit{de novo} approaches is based on deep learning since too few structures are available. Instead, most of them apply the time-consuming sampling-based strategies, and their performance seems to hit the plateau. In this work, we develop the first end-to-end deep learning approach, E2Efold-3D, to accurately perform the \textit{de novo} RNA structure prediction. Several novel components are proposed to overcome the data scarcity, such as a fully-differentiable end-to-end pipeline, secondary structure-assisted self-distillation, and parameter-efficient backbone formulation. Such designs are validated on the independent, non-overlapping RNA puzzle testing dataset and reach an average sub-4 \AA{} root-mean-square deviation, demonstrating its superior performance compared to state-of-the-art approaches. Interestingly, it also achieves promising results when predicting RNA complex structures, a feat that none of the previous systems could accomplish. When E2Efold-3D is coupled with the experimental techniques, the RNA structure prediction field can be greatly advanced.