TU Wien - Computer Vision Lab
Abstract:This paper introduces SAGHOG, a self-supervised pretraining strategy for writer retrieval using HOG features of the binarized input image. Our preprocessing involves the application of the Segment Anything technique to extract handwriting from various datasets, ending up with about 24k documents, followed by training a vision transformer on reconstructing masked patches of the handwriting. SAGHOG is then finetuned by appending NetRVLAD as an encoding layer to the pretrained encoder. Evaluation of our approach on three historical datasets, Historical-WI, HisFrag20, and GRK-Papyri, demonstrates the effectiveness of SAGHOG for writer retrieval. Additionally, we provide ablation studies on our architecture and evaluate un- and supervised finetuning. Notably, on HisFrag20, SAGHOG outperforms related work with a mAP of 57.2 % - a margin of 11.6 % to the current state of the art, showcasing its robustness on challenging data, and is competitive on even small datasets, e.g. GRK-Papyri, where we achieve a Top-1 accuracy of 58.0%.
Abstract:This paper presents an unsupervised approach for writer retrieval based on clustering SIFT descriptors detected at keypoint locations resulting in pseudo-cluster labels. With those cluster labels, a residual network followed by our proposed NetRVLAD, an encoding layer with reduced complexity compared to NetVLAD, is trained on 32x32 patches at keypoint locations. Additionally, we suggest a graph-based reranking algorithm called SGR to exploit similarities of the page embeddings to boost the retrieval performance. Our approach is evaluated on two historical datasets (Historical-WI and HisIR19). We include an evaluation of different backbones and NetRVLAD. It competes with related work on historical datasets without using explicit encodings. We set a new State-of-the-art on both datasets by applying our reranking scheme and show that our approach achieves comparable performance on a modern dataset as well.
Abstract:We present in this paper experiments on Table Recognition in hand-written registry books. We first explain how the problem of row and column detection is modeled, and then compare two Machine Learning approaches (Conditional Random Field and Graph Convolutional Network) for detecting these table elements. Evaluation was conducted on death records provided by the Archive of the Diocese of Passau. Both methods show similar results, a 89 F1 score, a quality which allows for Information Extraction. Software and dataset are open source/data.
Abstract:Text line detection is crucial for any application associated with Automatic Text Recognition or Keyword Spotting. Modern algorithms perform good on well-established datasets since they either comprise clean data or simple/homogeneous page layouts. We have collected and annotated 2036 archival document images from different locations and time periods. The dataset contains varying page layouts and degradations that challenge text line segmentation methods. Well established text line segmentation evaluation schemes such as the Detection Rate or Recognition Accuracy demand for binarized data that is annotated on a pixel level. Producing ground truth by these means is laborious and not needed to determine a method's quality. In this paper we propose a new evaluation scheme that is based on baselines. The proposed scheme has no need for binarization and it can handle skewed as well as rotated text lines. The ICDAR 2017 Competition on Baseline Detection and the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts used this evaluation scheme. Finally, we present results achieved by a recently published text line detection algorithm.