Abstract:Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data generation, such as random sampling or exhaustive exploration, are either intractable or may not capture rare, but highly informative configurations. In this study, we propose a method that leverages uncertainty as the collective variable (CV) to guide the acquisition of chemically-relevant data points, focusing on regions of the configuration space where ML model predictions are most uncertain. This approach employs a Gaussian Mixture Model-based uncertainty metric from a single model as the CV for biased molecular dynamics simulations. The effectiveness of our approach in overcoming energy barriers and exploring unseen energy minima, thereby enhancing the data set in an active learning framework, is demonstrated on the alanine dipeptide benchmark system.
Abstract:Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for ML MD simulation. We curate representative MD systems, including water, organic molecules, peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open-source codebase for training and simulation with ML FFs to facilitate further work.
Abstract:Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and tradeoffs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.
Abstract:Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to consistently preserve the geometry of local atomic neighborhoods, making the generated structures unsatisfying. In this paper, we propose an end-to-end solution for molecular conformation prediction called ConfVAE based on the conditional variational autoencoder framework. Specifically, the molecular graph is first encoded in a latent space, and then the 3D structures are generated by solving a principled bilevel optimization program. Extensive experiments on several benchmark data sets prove the effectiveness of our proposed approach over existing state-of-the-art approaches. Code is available at \url{https://github.com/MinkaiXu/ConfVAE-ICML21}.
Abstract:Machine learning has been widely adopted to accelerate the screening of materials. Most existing studies implicitly assume that the training data are generated through a deterministic, unbiased process, but this assumption might not hold for the simulation of some complex materials. In this work, we aim to screen amorphous polymer electrolytes which are promising candidates for the next generation lithium-ion battery technology but extremely expensive to simulate due to their structural complexity. We demonstrate that a multi-task graph neural network can learn from a large amount of noisy, biased data and a small number of unbiased data and reduce both random and systematic errors in predicting the transport properties of polymer electrolytes. This observation allows us to achieve accurate predictions on the properties of complex materials by learning to reduce errors in the training data, instead of running repetitive, expensive simulations which is conventionally used to reduce simulation errors. With this approach, we screen a space of 6247 polymer electrolytes, orders of magnitude larger than previous computational studies. We also find a good extrapolation performance to the top polymers from a larger space of 53362 polymers and 31 experimentally-realized polymers. The strategy employed in this work may be applicable to a broad class of material discovery problems that involve the simulation of complex, amorphous materials.
Abstract:Virtual screening can accelerate drug discovery by identifying top candidates for experimental testing. Machine learning is a powerful method for screening, as it can learn complex structure-property relationships from experimental data and make rapid predictions over virtual libraries. Although molecules are inherently three-dimensional and their biological action typically occurs through supramolecular recognition, most machine learning approaches use a 2D graph representation of molecules as input; few use 3D information, and none take into account the ensemble of conformers accessible to a species. Here we investigate whether the 3D information of multiple conformers can improve molecular property prediction. We introduce a number of new 3D-based models that can take multiple conformers as input to predict drug activity, and find that they learn interpretable weights for each conformer. The new architectures perform significantly better than 2D models, but their performance is just as strong with a single conformer as with many. From this analysis we identify the best 3D architecture and examine its predictions on species without experimental data.
Abstract:Machine learning outperforms traditional approaches in many molecular design tasks. Although molecules are often thought of as 2D graphs, they in fact consist of an ensemble of inter-converting 3D structures called conformers. Molecular properties arise from the contribution of many conformers, and in the case of a drug binding a target, may be due mainly to a few distinct members. Molecular representations in machine learning are typically based on either one single 3D conformer or on a 2D graph that strips geometrical information. No reference datasets exist that connect these graph and point cloud ensemble representations. Here, we use first-principles simulations to annotate over 400,000 molecules with the ensemble of geometries they span. The Geometrical Embedding Of Molecules (GEOM) dataset contains over 33 million molecular conformers labeled with their relative energies and statistical probabilities at room temperature. This dataset will assist benchmarking and transfer learning in two classes of tasks: inferring 3D properties from 2D molecular graphs, and developing generative models to sample 3D conformations.