Abstract:Identifying drug-target interactions is essential for developing effective therapeutics. Binding affinity quantifies these interactions, and traditional approaches rely on computationally intensive 3D structural data. In contrast, language models can efficiently process sequential data, offering an alternative approach to molecular representation. In the current study, we introduce BAPULM, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and ligands through MolFormer, eliminating reliance on complex 3D configurations. Our approach was validated extensively on benchmark datasets, achieving scoring power (R) values of 0.925 $\pm$ 0.043, 0.914 $\pm$ 0.004, and 0.8132 $\pm$ 0.001 on benchmark1k2101, Test2016_290, and CSAR-HiQ_36, respectively. These findings indicate the robustness and accuracy of BAPULM across diverse datasets and underscore the potential of sequence-based models in-silico drug discovery, offering a scalable alternative to 3D-centric methods for screening potential ligands.
Abstract:In recent years, natural language processing (NLP) models have demonstrated remarkable capabilities in various domains beyond traditional text generation. In this work, we introduce PeptideGPT, a protein language model tailored to generate protein sequences with distinct properties: hemolytic activity, solubility, and non-fouling characteristics. To facilitate a rigorous evaluation of these generated sequences, we established a comprehensive evaluation pipeline consisting of ideas from bioinformatics to retain valid proteins with ordered structures. First, we rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins. Finally, we predict the structure using ESMFold and select the proteins with pLDDT values greater than 70 to ensure ordered structure. The properties of generated sequences are evaluated using task-specific classifiers - PeptideBERT and HAPPENN. We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation. Our experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP-based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics. Codes, models, and data used in this study are freely available at: https://github.com/aayush-shah14/PeptideGPT.
Abstract:Adsorption energy is a key reactivity descriptor in catalysis, enabling the efficient screening of potential catalysts. However, determining adsorption energy involves comparing the energies of multiple adsorbate-catalyst configurations, which is computationally demanding due to a large number of possible configurations. Current algorithmic approaches typically enumerate adsorption sites and configurations without leveraging theoretical insights to guide the initial setup. In this work, we present Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently derive system-specific stable adsorption configurations with minimal human intervention. Adsorb-Agent leverages built-in knowledge and emergent reasoning capabilities, significantly reducing the number of initial configurations required while improving accuracy in predicting the minimum adsorption energy. We demonstrate its performance using two example systems, NNH-CuPd3 (111) and NNH-Mo3Pd (111), for the Nitrogen Reduction Reaction (NRR), a sustainable alternative to the Haber-Bosch process. Adsorb-Agent outperforms conventional "heuristic" and "random" algorithms by identifying lower-energy configurations with fewer initial setups, reducing computational costs while enhancing accuracy. This highlights its potential to accelerate catalyst discovery.
Abstract:Solving Partial Differential Equations (PDEs) is ubiquitous in science and engineering. Computational complexity and difficulty in writing numerical solvers has motivated the development of machine learning techniques to generate solutions quickly. Many existing methods are purely data driven, relying solely on numerical solution fields, rather than known system information such as boundary conditions and governing equations. However, the recent rise in popularity of Large Language Models (LLMs) has enabled easy integration of text in multimodal machine learning models. In this work, we use pretrained LLMs to integrate various amounts known system information into PDE learning. Our multimodal approach significantly outperforms our baseline model, FactFormer, in both next-step prediction and autoregressive rollout performance on the 2D Heat, Burgers, Navier-Stokes, and Shallow Water equations. Further analysis shows that pretrained LLMs provide highly structured latent space that is consistent with the amount of system information provided through text.
Abstract:Recent advances in deep learning have inspired numerous works on data-driven solutions to partial differential equation (PDE) problems. These neural PDE solvers can often be much faster than their numerical counterparts; however, each presents its unique limitations and generally balances training cost, numerical accuracy, and ease of applicability to different problem setups. To address these limitations, we introduce several methods to apply latent diffusion models to physics simulation. Firstly, we introduce a mesh autoencoder to compress arbitrarily discretized PDE data, allowing for efficient diffusion training across various physics. Furthermore, we investigate full spatio-temporal solution generation to mitigate autoregressive error accumulation. Lastly, we investigate conditioning on initial physical quantities, as well as conditioning solely on a text prompt to introduce text2PDE generation. We show that language can be a compact, interpretable, and accurate modality for generating physics simulations, paving the way for more usable and accessible PDE solvers. Through experiments on both uniform and structured grids, we show that the proposed approach is competitive with current neural PDE solvers in both accuracy and efficiency, with promising scaling behavior up to $\sim$3 billion parameters. By introducing a scalable, accurate, and usable physics simulator, we hope to bring neural PDE solvers closer to practical use.
Abstract:As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard-coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high-level plan, translate this plan into a series of low-level actions, and verify the completion of each step. The system is particularly tested on challenging tool-use tasks, which involve handling diverse objects and require long-horizon planning. PLATO's design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state-of-the-art in autonomous robotic task execution. For videos and prompt details, please see our project website: https://sites.google.com/andrew.cmu.edu/plato
Abstract:The design of aerodynamic shapes, such as airfoils, has traditionally required significant computational resources and relied on predefined design parameters, which limit the potential for novel shape synthesis. In this work, we introduce a data-driven methodology for airfoil generation using a diffusion model. Trained on a dataset of preexisting airfoils, our model can generate an arbitrary number of new airfoils from random vectors, which can be conditioned on specific aerodynamic performance metrics such as lift and drag, or geometric criteria. Our results demonstrate that the diffusion model effectively produces airfoil shapes with realistic aerodynamic properties, offering substantial improvements in efficiency, flexibility, and the potential for discovering innovative airfoil designs. This approach significantly expands the design space, facilitating the synthesis of high-performance aerodynamic shapes that transcend the limitations of traditional methods.
Abstract:Industry 4.0 has revolutionized manufacturing by driving digitalization and shifting the paradigm toward additive manufacturing (AM). Fused Deposition Modeling (FDM), a key AM technology, enables the creation of highly customized, cost-effective products with minimal material waste through layer-by-layer extrusion, posing a significant challenge to traditional subtractive methods. However, the susceptibility of material extrusion techniques to errors often requires expert intervention to detect and mitigate defects that can severely compromise product quality. While automated error detection and machine learning models exist, their generalizability across diverse 3D printer setups, firmware, and sensors is limited, and deep learning methods require extensive labeled datasets, hindering scalability and adaptability. To address these challenges, we present a process monitoring and control framework that leverages pre-trained Large Language Models (LLMs) alongside 3D printers to detect and address printing defects. The LLM evaluates print quality by analyzing images captured after each layer or print segment, identifying failure modes and querying the printer for relevant parameters. It then generates and executes a corrective action plan. We validated the effectiveness of the proposed framework in identifying defects by comparing it against a control group of engineers with diverse AM expertise. Our evaluation demonstrated that LLM-based agents not only accurately identify common 3D printing errors, such as inconsistent extrusion, stringing, warping, and layer adhesion, but also effectively determine the parameters causing these failures and autonomously correct them without any need for human intervention.
Abstract:The success of diffusion probabilistic models in generative tasks, such as text-to-image generation, has motivated the exploration of their application to regression problems commonly encountered in scientific computing and various other domains. In this context, the use of diffusion regression models for ensemble prediction is becoming a practice with increasing popularity. Under such background, we conducted a study to quantitatively evaluate the effectiveness of ensemble methods on solving different regression problems using diffusion models. We consider the ensemble prediction of a diffusion model as a means for zero-shot uncertainty quantification, since the diffusion models in our study are not trained with a loss function containing any uncertainty estimation. Through extensive experiments on 1D and 2D data, we demonstrate that ensemble methods consistently improve model prediction accuracy across various regression tasks. Notably, we observed a larger accuracy gain in auto-regressive prediction compared with point-wise prediction, and that enhancements take place in both the mean-square error and the physics-informed loss. Additionally, we reveal a statistical correlation between ensemble prediction error and ensemble variance, offering insights into balancing computational complexity with prediction accuracy and monitoring prediction confidence in practical applications where the ground truth is unknown. Our study provides a comprehensive view of the utility of diffusion ensembles, serving as a useful reference for practitioners employing diffusion models in regression problem-solving.
Abstract:There has recently been increasing attention towards developing foundational neural Partial Differential Equation (PDE) solvers and neural operators through large-scale pretraining. However, unlike vision and language models that make use of abundant and inexpensive (unlabeled) data for pretraining, these neural solvers usually rely on simulated PDE data, which can be costly to obtain, especially for high-dimensional PDEs. In this work, we aim to Pretrain neural PDE solvers on Lower Dimensional PDEs (PreLowD) where data collection is the least expensive. We evaluated the effectiveness of this pretraining strategy in similar PDEs in higher dimensions. We use the Factorized Fourier Neural Operator (FFNO) due to having the necessary flexibility to be applied to PDE data of arbitrary spatial dimensions and reuse trained parameters in lower dimensions. In addition, our work sheds light on the effect of the fine-tuning configuration to make the most of this pretraining strategy.