Abstract:We present a multimodal search tool that facilitates retrieval of chemical reactions, molecular structures, and associated text from scientific literature. Queries may combine molecular diagrams, textual descriptions, and reaction data, allowing users to connect different representations of chemical information. To support this, the indexing process includes chemical diagram extraction and parsing, extraction of reaction data from text in tabular form, and cross-modal linking of diagrams and their mentions in text. We describe the system's architecture, key functionalities, and retrieval process, along with expert assessments of the system. This demo highlights the workflow and technical components of the search system.
Abstract:A Pyramidal Histogram Of Characters (PHOC) represents the spatial location of symbols as binary vectors. The vectors are composed of levels that split a formula into equal-sized regions of one or more types (e.g., rectangles or ellipses). For each region type, this produces a pyramid of overlapping regions, where the first level contains the entire formula, and the final level the finest-grained regions. In this work, we introduce concentric rectangles for regions, and analyze whether subsequent PHOC levels encode redundant information by omitting levels from PHOC configurations. As a baseline, we include a bag of words PHOC containing only the first whole-formula level. Finally, using the ARQMath-3 formula retrieval benchmark, we demonstrate that some levels encoded in the original PHOC configurations are redundant, that PHOC models with rectangular regions outperform earlier PHOC models, and that despite their simplicity, PHOC models are surprisingly competitive with the state-of-the-art. PHOC is not math-specific, and might be used for chemical diagrams, charts, or other graphics.