University of Pittsburgh
Abstract:Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Diffusion models currently achieve state of the art performance for 3D molecule generation. In this work, we explore the use of flow matching, a recently proposed generative modeling framework that generalizes diffusion models, for the task of de novo molecule generation. Flow matching provides flexibility in model design; however, the framework is predicated on the assumption of continuously-valued data. 3D de novo molecule generation requires jointly sampling continuous and categorical variables such as atom position and atom type. We extend the flow matching framework to categorical data by constructing flows that are constrained to exist on a continuous representation of categorical data known as the probability simplex. We call this extension SimplexFlow. We explore the use of SimplexFlow for de novo molecule generation. However, we find that, in practice, a simpler approach that makes no accommodations for the categorical nature of the data yields equivalent or superior performance. As a result of these experiments, we present FlowMol, a flow matching model for 3D de novo generative model that achieves improved performance over prior flow matching methods, and we raise important questions about the design of prior distributions for achieving strong performance in flow matching models. Code and trained models for reproducing this work are available at https://github.com/dunni3/FlowMol
Abstract:Diffusion generative models have emerged as a powerful framework for addressing problems in structural biology and structure-based drug design. These models operate directly on 3D molecular structures. Due to the unfavorable scaling of graph neural networks (GNNs) with graph size as well as the relatively slow inference speeds inherent to diffusion models, many existing molecular diffusion models rely on coarse-grained representations of protein structure to make training and inference feasible. However, such coarse-grained representations discard essential information for modeling molecular interactions and impair the quality of generated structures. In this work, we present a novel GNN-based architecture for learning latent representations of molecular structure. When trained end-to-end with a diffusion model for de novo ligand design, our model achieves comparable performance to one with an all-atom protein representation while exhibiting a 3-fold reduction in inference time.
Abstract:The goal of structure-based drug discovery is to find small molecules that bind to a given target protein. Deep learning has been used to generate drug-like molecules with certain cheminformatic properties, but has not yet been applied to generating 3D molecules predicted to bind to proteins by sampling the conditional distribution of protein-ligand binding interactions. In this work, we describe for the first time a deep learning system for generating 3D molecular structures conditioned on a receptor binding site. We approach the problem using a conditional variational autoencoder trained on an atomic density grid representation of cross-docked protein-ligand structures. We apply atom fitting and bond inference procedures to construct valid molecular conformations from generated atomic densities. We evaluate the properties of the generated molecules and demonstrate that they change significantly when conditioned on mutated receptors. We also explore the latent space learned by our generative model using sampling and interpolation techniques. This work opens the door for end-to-end prediction of stable bioactive molecules from protein structures with deep learning.
Abstract:Machine learning methods in drug discovery have primarily focused on virtual screening of molecular libraries using discriminative models. Generative models are an entirely different approach to drug discovery that learn to represent and optimize molecules in a continuous latent space. These methods have already been applied with increasing success to the generation of two dimensional molecules as SMILES strings and molecular graphs. In this work, we describe deep generative models for three dimensional molecular structures using atomic density grids and a novel fitting algorithm that converts continuous grids to discrete molecular structures. Our models jointly represent drug-like molecules and their conformations in a latent space that can be explored through interpolation. We are able to sample diverse sets of molecules based on a given input compound and increase the probability of creating a valid, drug-like molecule.
Abstract:Deep generative models have been applied with increasing success to the generation of two dimensional molecules as SMILES strings and molecular graphs. In this work we describe for the first time a deep generative model that can generate 3D molecular structures conditioned on a three-dimensional (3D) binding pocket. Using convolutional neural networks, we encode atomic density grids into separate receptor and ligand latent spaces. The ligand latent space is variational to support sampling of new molecules. A decoder network generates atomic densities of novel ligands conditioned on the receptor. Discrete atoms are then fit to these continuous densities to create molecular structures. We show that valid and unique molecules can be readily sampled from the variational latent space defined by a reference `seed' structure and generated structures have reasonable interactions with the binding site. As structures are sampled farther in latent space from the seed structure, the novelty of the generated structures increases, but the predicted binding affinity decreases. Overall, we demonstrate the feasibility of conditional 3D molecular structure generation and provide a starting point for methods that also explicitly optimize for desired molecular properties, such as high binding affinity.
Abstract:Despite recent advancements in deep learning methods for protein structure prediction and representation, little focus has been directed at the simultaneous inclusion and prediction of protein backbone and sidechain structure information. We present SidechainNet, a new dataset that directly extends the ProteinNet dataset. SidechainNet includes angle and atomic coordinate information capable of describing all heavy atoms of each protein structure. In this paper, we first provide background information on the availability of protein structure data and the significance of ProteinNet. Thereafter, we argue for the potentially beneficial inclusion of sidechain information through SidechainNet, describe the process by which we organize SidechainNet, and provide a software package (https://github.com/jonathanking/sidechainnet) for data manipulation and training with machine learning models.
Abstract:There are many ways to represent a molecule as input to a machine learning model and each is associated with loss and retention of certain kinds of information. In the interest of preserving three-dimensional spatial information, including bond angles and torsions, we have developed libmolgrid, a general-purpose library for representing three-dimensional molecules using multidimensional arrays. This library also provides functionality for composing batches of data suited to machine learning workflows, including data augmentation, class balancing, and example stratification according to a regression variable or data subgroup, and it further supports temporal and spatial recurrences over that data to facilitate work with recurrent neural networks, dynamical data, and size extensive modeling. It was designed for seamless integration with popular deep learning frameworks, including Caffe, PyTorch, and Keras, providing good performance by leveraging graphical processing units (GPUs) for computationally-intensive tasks and efficient memory usage through the use of memory views over preallocated buffers. libmolgrid is a free and open source project that is actively supported, serving the growing need in the molecular modeling community for tools that streamline the process of data ingestion, representation construction, and principled machine learning model development.
Abstract:Protein-ligand scoring is an important step in a structure-based drug design pipeline. Selecting a correct binding pose and predicting the binding affinity of a protein-ligand complex enables effective virtual screening. Machine learning techniques can make use of the increasing amounts of structural data that are becoming publicly available. Convolutional neural network (CNN) scoring functions in particular have shown promise in pose selection and affinity prediction for protein-ligand complexes. Neural networks are known for being difficult to interpret. Understanding the decisions of a particular network can help tune parameters and training data to maximize performance. Visualization of neural networks helps decompose complex scoring functions into pictures that are more easily parsed by humans. Here we present three methods for visualizing how individual protein-ligand complexes are interpreted by 3D convolutional neural networks. We also present a visualization of the convolutional filters and their weights. We describe how the intuition provided by these visualizations aids in network design.
Abstract:Docking is an important tool in computational drug discovery that aims to predict the binding pose of a ligand to a target protein through a combination of pose scoring and optimization. A scoring function that is differentiable with respect to atom positions can be used for both scoring and gradient-based optimization of poses for docking. Using a differentiable grid-based atomic representation as input, we demonstrate that a scoring function learned by training a convolutional neural network (CNN) to identify binding poses can also be applied to pose optimization. We also show that an iteratively-trained CNN that includes poses optimized by the first CNN in its training set performs even better at optimizing randomly initialized poses than either the first CNN scoring function or AutoDock Vina.
Abstract:Computational approaches to drug discovery can reduce the time and cost associated with experimental assays and enable the screening of novel chemotypes. Structure-based drug design methods rely on scoring functions to rank and predict binding affinities and poses. The ever-expanding amount of protein-ligand binding and structural data enables the use of deep machine learning techniques for protein-ligand scoring. We describe convolutional neural network (CNN) scoring functions that take as input a comprehensive 3D representation of a protein-ligand interaction. A CNN scoring function automatically learns the key features of protein-ligand interactions that correlate with binding. We train and optimize our CNN scoring functions to discriminate between correct and incorrect binding poses and known binders and non-binders. We find that our CNN scoring function outperforms the AutoDock Vina scoring function when ranking poses both for pose prediction and virtual screening.