Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Ryan Koes

University of Pittsburgh

GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation

Apr 30, 2025

Filipp Nikitin, Ian Dunn, David Ryan Koes, Olexandr Isayev

Abstract:Deep generative models have shown significant promise in generating valid 3D molecular structures, with the GEOM-Drugs dataset serving as a key benchmark. However, current evaluation protocols suffer from critical flaws, including incorrect valency definitions, bugs in bond order calculations, and reliance on force fields inconsistent with the reference data. In this work, we revisit GEOM-Drugs and propose a corrected evaluation framework: we identify and fix issues in data preprocessing, construct chemically accurate valency tables, and introduce a GFN2-xTB-based geometry and energy benchmark. We retrain and re-evaluate several leading models under this framework, providing updated performance metrics and practical recommendations for future benchmarking. Our results underscore the need for chemically rigorous evaluation practices in 3D molecular generation. Our recommended evaluation methods and GEOM-Drugs processing scripts are available at https://github.com/isayevlab/geom-drugs-3dgen-evaluation.

Via

Access Paper or Ask Questions

Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation

Apr 30, 2024

Ian Dunn, David Ryan Koes

Abstract:Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Diffusion models currently achieve state of the art performance for 3D molecule generation. In this work, we explore the use of flow matching, a recently proposed generative modeling framework that generalizes diffusion models, for the task of de novo molecule generation. Flow matching provides flexibility in model design; however, the framework is predicated on the assumption of continuously-valued data. 3D de novo molecule generation requires jointly sampling continuous and categorical variables such as atom position and atom type. We extend the flow matching framework to categorical data by constructing flows that are constrained to exist on a continuous representation of categorical data known as the probability simplex. We call this extension SimplexFlow. We explore the use of SimplexFlow for de novo molecule generation. However, we find that, in practice, a simpler approach that makes no accommodations for the categorical nature of the data yields equivalent or superior performance. As a result of these experiments, we present FlowMol, a flow matching model for 3D de novo generative model that achieves improved performance over prior flow matching methods, and we raise important questions about the design of prior distributions for achieving strong performance in flow matching models. Code and trained models for reproducing this work are available at https://github.com/dunni3/FlowMol

Via

Access Paper or Ask Questions

Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure

Nov 22, 2023

Ian Dunn, David Ryan Koes

Figure 1 for Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure

Figure 2 for Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure

Figure 3 for Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure

Figure 4 for Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure

Abstract:Diffusion generative models have emerged as a powerful framework for addressing problems in structural biology and structure-based drug design. These models operate directly on 3D molecular structures. Due to the unfavorable scaling of graph neural networks (GNNs) with graph size as well as the relatively slow inference speeds inherent to diffusion models, many existing molecular diffusion models rely on coarse-grained representations of protein structure to make training and inference feasible. However, such coarse-grained representations discard essential information for modeling molecular interactions and impair the quality of generated structures. In this work, we present a novel GNN-based architecture for learning latent representations of molecular structure. When trained end-to-end with a diffusion model for de novo ligand design, our model achieves comparable performance to one with an all-atom protein representation while exhibiting a 3-fold reduction in inference time.

* This paper appeared as a spotlight paper at the NeurIPS 2023 Generative AI and Biology Workshop

Via

Access Paper or Ask Questions

Generating 3D Molecules Conditional on Receptor Binding Sites with Deep Generative Models

Oct 28, 2021

Matthew Ragoza, Tomohide Masuda, David Ryan Koes

Figure 1 for Generating 3D Molecules Conditional on Receptor Binding Sites with Deep Generative Models

Figure 2 for Generating 3D Molecules Conditional on Receptor Binding Sites with Deep Generative Models

Figure 3 for Generating 3D Molecules Conditional on Receptor Binding Sites with Deep Generative Models

Figure 4 for Generating 3D Molecules Conditional on Receptor Binding Sites with Deep Generative Models

Abstract:The goal of structure-based drug discovery is to find small molecules that bind to a given target protein. Deep learning has been used to generate drug-like molecules with certain cheminformatic properties, but has not yet been applied to generating 3D molecules predicted to bind to proteins by sampling the conditional distribution of protein-ligand binding interactions. In this work, we describe for the first time a deep learning system for generating 3D molecular structures conditioned on a receptor binding site. We approach the problem using a conditional variational autoencoder trained on an atomic density grid representation of cross-docked protein-ligand structures. We apply atom fitting and bond inference procedures to construct valid molecular conformations from generated atomic densities. We evaluate the properties of the generated molecules and demonstrate that they change significantly when conditioned on mutated receptors. We also explore the latent space learned by our generative model using sampling and interpolation techniques. This work opens the door for end-to-end prediction of stable bioactive molecules from protein structures with deep learning.

* Main: 12 pages, 7 figures; Supplement: 4 pages, 7 figures

Via

Access Paper or Ask Questions

Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models

Oct 20, 2020

Matthew Ragoza, Tomohide Masuda, David Ryan Koes

Figure 1 for Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models

Figure 2 for Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models

Figure 3 for Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models

Figure 4 for Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models

Abstract:Machine learning methods in drug discovery have primarily focused on virtual screening of molecular libraries using discriminative models. Generative models are an entirely different approach to drug discovery that learn to represent and optimize molecules in a continuous latent space. These methods have already been applied with increasing success to the generation of two dimensional molecules as SMILES strings and molecular graphs. In this work, we describe deep generative models for three dimensional molecular structures using atomic density grids and a novel fitting algorithm that converts continuous grids to discrete molecular structures. Our models jointly represent drug-like molecules and their conformations in a latent space that can be explored through interpolation. We are able to sample diverse sets of molecules based on a given input compound and increase the probability of creating a valid, drug-like molecule.

* Added acknowledgements

Via

Access Paper or Ask Questions

Generating 3D Molecular Structures Conditional on a Receptor Binding Site with Deep Generative Models

Oct 16, 2020

Tomohide Masuda, Matthew Ragoza, David Ryan Koes

Figure 1 for Generating 3D Molecular Structures Conditional on a Receptor Binding Site with Deep Generative Models

Figure 2 for Generating 3D Molecular Structures Conditional on a Receptor Binding Site with Deep Generative Models

Figure 3 for Generating 3D Molecular Structures Conditional on a Receptor Binding Site with Deep Generative Models

Figure 4 for Generating 3D Molecular Structures Conditional on a Receptor Binding Site with Deep Generative Models

Abstract:Deep generative models have been applied with increasing success to the generation of two dimensional molecules as SMILES strings and molecular graphs. In this work we describe for the first time a deep generative model that can generate 3D molecular structures conditioned on a three-dimensional (3D) binding pocket. Using convolutional neural networks, we encode atomic density grids into separate receptor and ligand latent spaces. The ligand latent space is variational to support sampling of new molecules. A decoder network generates atomic densities of novel ligands conditioned on the receptor. Discrete atoms are then fit to these continuous densities to create molecular structures. We show that valid and unique molecules can be readily sampled from the variational latent space defined by a reference `seed' structure and generated structures have reasonable interactions with the binding site. As structures are sampled farther in latent space from the seed structure, the novelty of the generated structures increases, but the predicted binding affinity decreases. Overall, we demonstrate the feasibility of conditional 3D molecular structure generation and provide a starting point for methods that also explicitly optimize for desired molecular properties, such as high binding affinity.

Via

Access Paper or Ask Questions

SidechainNet: An All-Atom Protein Structure Dataset for Machine Learning

Oct 16, 2020

Jonathan E. King, David Ryan Koes

Figure 1 for SidechainNet: An All-Atom Protein Structure Dataset for Machine Learning

Figure 2 for SidechainNet: An All-Atom Protein Structure Dataset for Machine Learning

Figure 3 for SidechainNet: An All-Atom Protein Structure Dataset for Machine Learning

Abstract:Despite recent advancements in deep learning methods for protein structure prediction and representation, little focus has been directed at the simultaneous inclusion and prediction of protein backbone and sidechain structure information. We present SidechainNet, a new dataset that directly extends the ProteinNet dataset. SidechainNet includes angle and atomic coordinate information capable of describing all heavy atoms of each protein structure. In this paper, we first provide background information on the availability of protein structure data and the significance of ProteinNet. Thereafter, we argue for the potentially beneficial inclusion of sidechain information through SidechainNet, describe the process by which we organize SidechainNet, and provide a software package (https://github.com/jonathanking/sidechainnet) for data manipulation and training with machine learning models.

* 8 pages, 2 figures, 1 table, Submitted to Machine Learning for Structural Biology Workshop at the 34th Conference on Neural Information Processing Systems

Via

Access Paper or Ask Questions

libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

Dec 10, 2019

Jocelyn Sunseri, David Ryan Koes

Figure 1 for libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

Figure 2 for libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

Figure 3 for libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

Figure 4 for libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

Abstract:There are many ways to represent a molecule as input to a machine learning model and each is associated with loss and retention of certain kinds of information. In the interest of preserving three-dimensional spatial information, including bond angles and torsions, we have developed libmolgrid, a general-purpose library for representing three-dimensional molecules using multidimensional arrays. This library also provides functionality for composing batches of data suited to machine learning workflows, including data augmentation, class balancing, and example stratification according to a regression variable or data subgroup, and it further supports temporal and spatial recurrences over that data to facilitate work with recurrent neural networks, dynamical data, and size extensive modeling. It was designed for seamless integration with popular deep learning frameworks, including Caffe, PyTorch, and Keras, providing good performance by leveraging graphical processing units (GPUs) for computationally-intensive tasks and efficient memory usage through the use of memory views over preallocated buffers. libmolgrid is a free and open source project that is actively supported, serving the growing need in the molecular modeling community for tools that streamline the process of data ingestion, representation construction, and principled machine learning model development.

Via

Access Paper or Ask Questions

Visualizing Convolutional Neural Network Protein-Ligand Scoring

Mar 06, 2018

Joshua Hochuli, Alec Helbling, Tamar Skaist, Matthew Ragoza, David Ryan Koes

Figure 1 for Visualizing Convolutional Neural Network Protein-Ligand Scoring

Figure 2 for Visualizing Convolutional Neural Network Protein-Ligand Scoring

Figure 3 for Visualizing Convolutional Neural Network Protein-Ligand Scoring

Figure 4 for Visualizing Convolutional Neural Network Protein-Ligand Scoring

Abstract:Protein-ligand scoring is an important step in a structure-based drug design pipeline. Selecting a correct binding pose and predicting the binding affinity of a protein-ligand complex enables effective virtual screening. Machine learning techniques can make use of the increasing amounts of structural data that are becoming publicly available. Convolutional neural network (CNN) scoring functions in particular have shown promise in pose selection and affinity prediction for protein-ligand complexes. Neural networks are known for being difficult to interpret. Understanding the decisions of a particular network can help tune parameters and training data to maximize performance. Visualization of neural networks helps decompose complex scoring functions into pictures that are more easily parsed by humans. Here we present three methods for visualizing how individual protein-ligand complexes are interpreted by 3D convolutional neural networks. We also present a visualization of the convolutional filters and their weights. We describe how the intuition provided by these visualizations aids in network design.

Via

Access Paper or Ask Questions

Ligand Pose Optimization with Atomic Grid-Based Convolutional Neural Networks

Oct 20, 2017

Matthew Ragoza, Lillian Turner, David Ryan Koes

Figure 1 for Ligand Pose Optimization with Atomic Grid-Based Convolutional Neural Networks

Figure 2 for Ligand Pose Optimization with Atomic Grid-Based Convolutional Neural Networks

Figure 3 for Ligand Pose Optimization with Atomic Grid-Based Convolutional Neural Networks

Figure 4 for Ligand Pose Optimization with Atomic Grid-Based Convolutional Neural Networks

Abstract:Docking is an important tool in computational drug discovery that aims to predict the binding pose of a ligand to a target protein through a combination of pose scoring and optimization. A scoring function that is differentiable with respect to atom positions can be used for both scoring and gradient-based optimization of poses for docking. Using a differentiable grid-based atomic representation as input, we demonstrate that a scoring function learned by training a convolutional neural network (CNN) to identify binding poses can also be applied to pose optimization. We also show that an iteratively-trained CNN that includes poses optimized by the first CNN in its training set performs even better at optimizing randomly initialized poses than either the first CNN scoring function or AutoDock Vina.

* 10 pages

Via

Access Paper or Ask Questions