Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Neumann

Orb: A Fast, Scalable Neural Network Potential

Oct 29, 2024

Mark Neumann, James Gin, Benjamin Rhodes, Steven Bennett, Zhiyi Li, Hitarth Choubisa, Arthur Hussey, Jonathan Godwin

Abstract:We introduce Orb, a family of universal interatomic potentials for atomistic modelling of materials. Orb models are 3-6 times faster than existing universal potentials, stable under simulation for a range of out of distribution materials and, upon release, represented a 31% reduction in error over other methods on the Matbench Discovery benchmark. We explore several aspects of foundation model development for materials, with a focus on diffusion pretraining. We evaluate Orb as a model for geometry optimization, Monte Carlo and molecular dynamics simulations.

Via

Access Paper or Ask Questions

PAWLS: PDF Annotation With Labels and Structure

Jan 25, 2021

Mark Neumann, Zejiang Shen, Sam Skjonsberg

Figure 1 for PAWLS: PDF Annotation With Labels and Structure

Figure 2 for PAWLS: PDF Annotation With Labels and Structure

Figure 3 for PAWLS: PDF Annotation With Labels and Structure

Figure 4 for PAWLS: PDF Annotation With Labels and Structure

Abstract:Adobe's Portable Document Format (PDF) is a popular way of distributing view-only documents with a rich visual markup. This presents a challenge to NLP practitioners who wish to use the information contained within PDF documents for training models or data analysis, because annotating these documents is difficult. In this paper, we present PDF Annotation with Labels and Structure (PAWLS), a new annotation tool designed specifically for the PDF document format. PAWLS is particularly suited for mixed-mode annotation and scenarios in which annotators require extended context to annotate accurately. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes, all of which can be exported in convenient formats for training multi-modal machine learning models. A read-only PAWLS server is available at https://pawls.apps.allenai.org/ and the source code is available at https://github.com/allenai/pawls.

Via

Access Paper or Ask Questions

PySBD: Pragmatic Sentence Boundary Disambiguation

Oct 19, 2020

Nipun Sadvilkar, Mark Neumann

Figure 1 for PySBD: Pragmatic Sentence Boundary Disambiguation

Figure 2 for PySBD: Pragmatic Sentence Boundary Disambiguation

Figure 3 for PySBD: Pragmatic Sentence Boundary Disambiguation

Abstract:In this paper, we present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language-specific set of sentence boundary exemplars) originally implemented as a ruby gem - pragmatic_segmenter - which we ported to Python with additional improvements and functionality. PySBD passes 97.92% of the Golden Rule Set exemplars for English, an improvement of 25% over the next best open-source Python tool.

* 'PySBD: Pragmatic Sentence Boundary Disambiguation' is a short paper (5 Pages with references) accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020 happening on 19 Nov 2020

Via

Access Paper or Ask Questions

GORC: A large contextual citation graph of academic papers

Nov 07, 2019

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, Dan S. Weld

Figure 1 for GORC: A large contextual citation graph of academic papers

Figure 2 for GORC: A large contextual citation graph of academic papers

Figure 3 for GORC: A large contextual citation graph of academic papers

Figure 4 for GORC: A large contextual citation graph of academic papers

Abstract:We introduce the Semantic Scholar Graph of References in Context (GORC), a large contextual citation graph of 81.1M academic publications, including parsed full text for 8.1M open access papers, across broad domains of science. Each paper is represented with rich paper metadata (title, authors, abstract, etc.), and where available: cleaned full text, section headers, figure and table captions, and parsed bibliography entries. In-line citation mentions in full text are linked to their corresponding bibliography entries, which are in turn linked to in-corpus cited papers, forming the edges of a contextual citation graph. To our knowledge, this is the largest publicly available contextual citation graph; the full text alone is the largest parsed academic text corpus publicly available. We demonstrate the ability to identify similar papers using these citation contexts and propose several applications for language modeling and citation-related tasks.

* 12 pages, 2 figures, 5 appendices

Via

Access Paper or Ask Questions

Knowledge Enhanced Contextual Word Representations

Sep 09, 2019

Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, Noah A. Smith

Figure 1 for Knowledge Enhanced Contextual Word Representations

Figure 2 for Knowledge Enhanced Contextual Word Representations

Figure 3 for Knowledge Enhanced Contextual Word Representations

Figure 4 for Knowledge Enhanced Contextual Word Representations

Abstract:Contextual word representations, typically trained on unstructured, unlabeled text, do not contain any explicit grounding to real world entities and are often unable to remember facts about those entities. We propose a general method to embed multiple knowledge bases (KBs) into large scale models, and thereby enhance their representations with structured, human-curated knowledge. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update contextual word representations via a form of word-to-entity attention. In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that combines a small amount of entity linking supervision with a large amount of raw text. After integrating WordNet and a subset of Wikipedia into BERT, the knowledge enhanced BERT (KnowBert) demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation. KnowBert's runtime is comparable to BERT's and it scales to large KBs.

* EMNLP 2019

Via

Access Paper or Ask Questions

Grammar-based Neural Text-to-SQL Generation

May 30, 2019

Kevin Lin, Ben Bogin, Mark Neumann, Jonathan Berant, Matt Gardner

Figure 1 for Grammar-based Neural Text-to-SQL Generation

Figure 2 for Grammar-based Neural Text-to-SQL Generation

Figure 3 for Grammar-based Neural Text-to-SQL Generation

Figure 4 for Grammar-based Neural Text-to-SQL Generation

Abstract:The sequence-to-sequence paradigm employed by neural text-to-SQL models typically performs token-level decoding and does not consider generating SQL hierarchically from a grammar. Grammar-based decoding has shown significant improvements for other semantic parsing tasks, but SQL and other general programming languages have complexities not present in logical formalisms that make writing hierarchical grammars difficult. We introduce techniques to handle these complexities, showing how to construct a schema-dependent grammar with minimal over-generation. We analyze these techniques on ATIS and Spider, two challenging text-to-SQL datasets, demonstrating that they yield 14--18\% relative reductions in error.

Via

Access Paper or Ask Questions

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Feb 21, 2019

Mark Neumann, Daniel King, Iz Beltagy, Waleed Ammar

Figure 1 for ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Figure 2 for ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Figure 3 for ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Figure 4 for ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Abstract:Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new tool for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/

Via

Access Paper or Ask Questions

Dissecting Contextual Word Embeddings: Architecture and Representation

Sep 27, 2018

Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, Wen-tau Yih

Figure 1 for Dissecting Contextual Word Embeddings: Architecture and Representation

Figure 2 for Dissecting Contextual Word Embeddings: Architecture and Representation

Figure 3 for Dissecting Contextual Word Embeddings: Architecture and Representation

Figure 4 for Dissecting Contextual Word Embeddings: Architecture and Representation

Abstract:Contextual word representations derived from pre-trained bidirectional language models (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower contextual layers to longer range semantics such coreference at the upper layers. Together, these results suggest that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

* EMNLP 2018

Via

Access Paper or Ask Questions

Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context

Jun 20, 2018

Lucy Lu Wang, Chandra Bhagavatula, Mark Neumann, Kyle Lo, Chris Wilhelm, Waleed Ammar

Figure 1 for Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context

Figure 2 for Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context

Figure 3 for Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context

Figure 4 for Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context

Abstract:Ontology alignment is the task of identifying semantically equivalent entities from two given ontologies. Different ontologies have different representations of the same entity, resulting in a need to de-duplicate entities when merging ontologies. We propose a method for enriching entities in an ontology with external definition and context information, and use this additional information for ontology alignment. We develop a neural architecture capable of encoding the additional information when available, and show that the addition of external data results in an F1-score of 0.69 on the Ontology Alignment Evaluation Initiative (OAEI) largebio SNOMED-NCI subtask, comparable with the entity-level matchers in a SOTA system.

* ACL 2018 BioNLP workshop

Via

Access Paper or Ask Questions

AllenNLP: A Deep Semantic Natural Language Processing Platform

May 31, 2018

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, Luke Zettlemoyer

Abstract:This paper describes AllenNLP, a platform for research on deep learning methods in natural language understanding. AllenNLP is designed to support researchers who want to build novel language understanding models quickly and easily. It is built on top of PyTorch, allowing for dynamic computation graphs, and provides (1) a flexible data API that handles intelligent batching and padding, (2) high-level abstractions for common operations in working with text, and (3) a modular and extensible experiment framework that makes doing good science easy. It also includes reference implementations of high quality approaches for both core semantic problems (e.g. semantic role labeling (Palmer et al., 2005)) and language understanding applications (e.g. machine comprehension (Rajpurkar et al., 2016)). AllenNLP is an ongoing open-source effort maintained by engineers and researchers at the Allen Institute for Artificial Intelligence.

* Describes the initial version of AllenNLP. Many features and models have been added since the first release. This is the paper to cite if you use AllenNLP in your research. Updated 5/31/2018 with version accepted to the NLP OSS workshop help at ACL 2018

Via

Access Paper or Ask Questions