Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Laura Mandell

Font Identification in Historical Documents Using Active Learning

Jan 27, 2016

Anshul Gupta, Ricardo Gutierrez-Osuna, Matthew Christy, Richard Furuta, Laura Mandell

Figure 1 for Font Identification in Historical Documents Using Active Learning

Figure 2 for Font Identification in Historical Documents Using Active Learning

Figure 3 for Font Identification in Historical Documents Using Active Learning

Figure 4 for Font Identification in Historical Documents Using Active Learning

Abstract:Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.

Via

Access Paper or Ask Questions