Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthew Bierbaum

On the Use of ArXiv as a Dataset

Apr 30, 2019

Colin B. Clement, Matthew Bierbaum, Kevin P. O'Keeffe, Alexander A. Alemi

Figure 1 for On the Use of ArXiv as a Dataset

Figure 2 for On the Use of ArXiv as a Dataset

Figure 3 for On the Use of ArXiv as a Dataset

Figure 4 for On the Use of ArXiv as a Dataset

Abstract:The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.

* 7 pages, 3 tables, 2 figures, ICLR 2019 workshop RLGM submission

Via

Access Paper or Ask Questions