Abstract:In this paper, we fine-tuned three pre-trained BERT models on the task of "definition extraction" from mathematical English written in LaTeX. This is presented as a binary classification problem, where either a sentence contains a definition of a mathematical term or it does not. We used two original data sets, "Chicago" and "TAC," to fine-tune and test these models. We also tested on WFMALL, a dataset presented by Vanetik and Litvak in 2021 and compared the performance of our models to theirs. We found that a high-performance Sentence-BERT transformer model performed best based on overall accuracy, recall, and precision metrics, achieving comparable results to the earlier models with less computational effort.
Abstract:MathGloss is a project to create a knowledge graph (KG) for undergraduate mathematics from text, automatically, using modern natural language processing (NLP) tools and resources already available on the web. MathGloss is a linked database of undergraduate concepts in mathematics. So far, it combines five resources: (i) Wikidata, a collaboratively edited, multilingual knowledge graph hosted by the Wikimedia Foundation, (ii) terms covered in mathematics courses at the University of Chicago, (iii) the syllabus of the French undergraduate mathematics curriculum which includes hyperlinks to the automated theorem prover Lean 4, (iv) MuLiMa, a multilingual dictionary of mathematics curated by mathematicians, and (v) the nLab, a wiki for category theory also curated by mathematicians. MathGloss's goal is to bring together resources for learning mathematics and to allow every mathematician to tailor their learning to their own preferences. Moreover, by organizing different resources for learning undergraduate mathematics alongside those for learning formal mathematics, we hope to make it easier for mathematicians and formal tools (theorem provers, computer algebra systems, etc) experts to "understand" each other and break down some of the barriers to formal math.