Abstract:The accurate prediction of physicochemical properties of chemical compounds in mixtures (such as the activity coefficient at infinite dilution $\gamma_{ij}^\infty$) is essential for developing novel and more sustainable chemical processes. In this work, we analyze the performance of previously-proposed GNN-based models for the prediction of $\gamma_{ij}^\infty$, and compare them with several mechanistic models in a series of 9 isothermal studies. Moreover, we develop the Gibbs-Helmholtz Graph Neural Network (GH-GNN) model for predicting $\ln \gamma_{ij}^\infty$ of molecular systems at different temperatures. Our method combines the simplicity of a Gibbs-Helmholtz-derived expression with a series of graph neural networks that incorporate explicit molecular and intermolecular descriptors for capturing dispersion and hydrogen bonding effects. We have trained this model using experimentally determined $\ln \gamma_{ij}^\infty$ data of 40,219 binary-systems involving 1032 solutes and 866 solvents, overall showing superior performance compared to the popular UNIFAC-Dortmund model. We analyze the performance of GH-GNN for continuous and discrete inter/extrapolation and give indications for the model's applicability domain and expected accuracy. In general, GH-GNN is able to produce accurate predictions for extrapolated binary-systems if at least 25 systems with the same combination of solute-solvent chemical classes are contained in the training set and a similarity indicator above 0.35 is also present. This model and its applicability domain recommendations have been made open-source at https://github.com/edgarsmdn/GH-GNN.
Abstract:In recent years, graph neural networks (GNNs) have gained increasing popularity and have shown very promising results for data that are represented by graphs. The majority of GNN architectures are designed based on developing new convolutional and/or pooling layers that better extract the hidden and deeper representations of the graphs to be used for different prediction tasks. The inputs to these layers are mainly the three default descriptors of a graph, node features $(X)$, adjacency matrix $(A)$, and edge features $(W)$ (if available). To provide a more enriched input to the network, we propose a random walk data processing of the graphs based on three selected lengths. Namely, (regular) walks of length 1 and 2, and a fractional walk of length $\gamma \in (0,1)$, in order to capture the different local and global dynamics on the graphs. We also calculate the stationary distribution of each random walk, which is then used as a scaling factor for the initial node features ($X$). This way, for each graph, the network receives multiple adjacency matrices along with their individual weighting for the node features. We test our method on various molecular datasets by passing the processed node features to the network in order to perform several classification and regression tasks. Interestingly, our method, not using edge features which are heavily exploited in molecular graph learning, let a shallow network outperform well known deep GNNs.