Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lawrence Zhao

$\textit{L+M-24}$: Building a Dataset for Language + Molecules @ ACL 2024

Feb 22, 2024

Carl Edwards, Qingyun Wang, Lawrence Zhao, Heng Ji

Abstract:Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.

* The dataset, finetuned baselines, and evaluation code are released publicly at https://github.com/language-plus-molecules/LPM-24-Dataset through https://huggingface.co/language-plus-molecules

Via

Access Paper or Ask Questions