Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lane E. Schultz

Regression with Large Language Models for Materials and Molecular Property Prediction

Sep 09, 2024

Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan

Abstract:We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.

Via

Access Paper or Ask Questions

Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction

May 28, 2024

Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan

Figure 1 for Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction

Figure 2 for Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction

Figure 3 for Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction

Figure 4 for Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction

Abstract:Knowledge of the domain of applicability of a machine learning model is essential to ensuring accurate and reliable model predictions. In this work, we develop a new approach of assessing model domain and demonstrate that our approach provides accurate and meaningful designation of in-domain versus out-of-domain when applied across multiple model types and material property data sets. Our approach assesses the distance between a test and training data point in feature space by using kernel density estimation and shows that this distance provides an effective tool for domain determination. We show that chemical groups considered unrelated based on established chemical knowledge exhibit significant dissimilarities by our measure. We also show that high measures of dissimilarity are associated with poor model performance (i.e., high residual magnitudes) and poor estimates of model uncertainty (i.e., unreliable uncertainty estimation). Automated tools are provided to enable researchers to establish acceptable dissimilarity thresholds to identify whether new predictions of their own machine learning models are in-domain versus out-of-domain.

* 43 pages, 34 figures, journal submission

Via

Access Paper or Ask Questions

Accelerating Ensemble Error Bar Prediction with Single Models Fits

Apr 15, 2024

Vidit Agrawal, Shixin Zhang, Lane E. Schultz, Dane Morgan

Abstract:Ensemble models can be used to estimate prediction uncertainties in machine learning models. However, an ensemble of N models is approximately N times more computationally demanding compared to a single model when it is used for inference. In this work, we explore fitting a single model to predicted ensemble error bar data, which allows us to estimate uncertainties without the need for a full ensemble. Our approach is based on three models: Model A for predictive accuracy, Model $A_{E}$ for traditional ensemble-based error bar prediction, and Model B, fit to data from Model $A_{E}$, to be used for predicting the values of $A_{E}$ but with only one model evaluation. Model B leverages synthetic data augmentation to estimate error bars efficiently. This approach offers a highly flexible method of uncertainty quantification that can approximate that of ensemble methods but only requires a single extra model evaluation over Model A during inference. We assess this approach on a set of problems in materials science.

* 14 pages, 4 figures, 1 table

Via

Access Paper or Ask Questions