Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Johannes Baiter

Towards Robust Named Entity Recognition for Historic German

Jun 18, 2019

Stefan Schweter, Johannes Baiter

Figure 1 for Towards Robust Named Entity Recognition for Historic German

Figure 2 for Towards Robust Named Entity Recognition for Historic German

Figure 3 for Towards Robust Named Entity Recognition for Historic German

Figure 4 for Towards Robust Named Entity Recognition for Historic German

Abstract:Recent advances in language modeling using deep neural networks have shown that these models learn representations, that vary with the network depth from morphology to semantic relationships like co-reference. We apply pre-trained language models to low-resource named entity recognition for Historic German. We show on a series of experiments that character-based pre-trained language models do not run into trouble when faced with low-resource datasets. Our pre-trained character-based language models improve upon classical CRF-based methods and previous work on Bi-LSTMs by boosting F1 score performance by up to 6%. Our pre-trained language and NER models are publicly available under https://github.com/stefan-it/historic-ner .

* 8 pages, 5 figures, accepted at the 4th Workshop on Representation Learning for NLP (RepL4NLP), held in conjunction with ACL 2019

Via

Access Paper or Ask Questions

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Sep 14, 2018

Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter

Figure 1 for Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Figure 2 for Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Figure 3 for Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Figure 4 for Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Abstract:In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95\% (early printings) and 98\% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.

* Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on Automatic Text and Layout Recognition

Via

Access Paper or Ask Questions