Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sungrae Park

KIEval: Evaluation Metric for Document Key Information Extraction

Mar 07, 2025

Minsoo Khang, Sang Chul Jung, Sungrae Park, Teakgyu Hong

Figure 1 for KIEval: Evaluation Metric for Document Key Information Extraction

Figure 2 for KIEval: Evaluation Metric for Document Key Information Extraction

Figure 3 for KIEval: Evaluation Metric for Document Key Information Extraction

Figure 4 for KIEval: Evaluation Metric for Document Key Information Extraction

Abstract:Document Key Information Extraction (KIE) is a technology that transforms valuable information in document images into structured data, and it has become an essential function in industrial settings. However, current evaluation metrics of this technology do not accurately reflect the critical attributes of its industrial applications. In this paper, we present KIEval, a novel application-centric evaluation metric for Document KIE models. Unlike prior metrics, KIEval assesses Document KIE models not just on the extraction of individual information (entity) but also of the structured information (grouping). Evaluation of structured information provides assessment of Document KIE models that are more reflective of extracting grouped information from documents in industrial settings. Designed with industrial application in mind, we believe that KIEval can become a standard evaluation metric for developing or applying Document KIE models in practice. The code will be publicly available.

Via

Access Paper or Ask Questions

SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation

Aug 21, 2023

Qingwen Bu, Sungrae Park, Minsoo Khang, Yichuan Cheng

Abstract:Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based methods and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing additional computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance.

Via

Access Paper or Ask Questions

Domain Generalization by Mutual-Information Regularization with Pre-trained Models

Mar 21, 2022

Junbum Cha, Kyungjae Lee, Sungrae Park, Sanghyuk Chun

Figure 1 for Domain Generalization by Mutual-Information Regularization with Pre-trained Models

Figure 2 for Domain Generalization by Mutual-Information Regularization with Pre-trained Models

Figure 3 for Domain Generalization by Mutual-Information Regularization with Pre-trained Models

Figure 4 for Domain Generalization by Mutual-Information Regularization with Pre-trained Models

Abstract:Domain generalization (DG) aims to learn a generalized model to an unseen target domain using only limited source domains. Previous attempts to DG fail to learn domain-invariant representations only from the source domains due to the significant domain shifts between training and test domains. Instead, we re-formulate the DG objective using mutual information with the oracle model, a model generalized to any possible domain. We derive a tractable variational lower bound via approximating the oracle model by a pre-trained model, called Mutual Information Regularization with Oracle (MIRO). Our extensive experiments show that MIRO significantly improves the out-of-distribution performance. Furthermore, our scaling experiments show that the larger the scale of the pre-trained model, the greater the performance improvement of MIRO. Source code is available at https://github.com/kakaobrain/miro.

Via

Access Paper or Ask Questions

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Nov 30, 2021

Byeonghu Na, Yoonsik Kim, Sungrae Park

Figure 1 for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Figure 2 for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Figure 3 for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Figure 4 for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Abstract:Linguistic knowledge has brought great benefits to scene text recognition by providing semantics to refine character sequences. However, since linguistic knowledge has been applied individually on the output sequence, previous methods have not fully utilized the semantics to understand visual clues for text recognition. This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances. Specifically, MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features. Based on the spatial encoding, visual and semantic features are enhanced by referring to related features in the other modality. Furthermore, MATRN stimulates combining semantic features into visual features by hiding visual clues related to the character in the training phase. Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins, while naive combinations of two modalities show marginal improvements. Further ablative studies prove the effectiveness of our proposed components. Our implementation will be publicly available.

Via

Access Paper or Ask Questions

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Sep 10, 2021

Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park

Figure 1 for BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Figure 2 for BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Figure 3 for BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Figure 4 for BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Abstract:Key information extraction (KIE) from document images requires understanding the contextual and spatial semantics of texts in two-dimensional (2D) space. Many recent studies try to solve the task by developing pre-training language models focusing on combining visual features from document images with texts and their layout. On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout. Specifically, we propose a pre-trained language model, named BROS (BERT Relying On Spatiality), that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy. With this optimized training scheme for understanding texts in 2D space, BROS shows comparable or better performance compared to previous methods on four KIE benchmarks (FUNSD, SROIE*, CORD, and SciTSR) without relying on visual features. This paper also reveals two real-world challenges in KIE tasks--(1) minimizing the error from incorrect text ordering and (2) efficient learning from fewer downstream examples--and demonstrates the superiority of BROS over previous methods. Our code will be open to the public.

* 13 pages, 9 figures

Via

Access Paper or Ask Questions

RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image

Jul 23, 2021

Junyeop Lee, Yoonsik Kim, Seonghyeon Kim, Moonbin Yim, Seung Shin, Gayoung Lee, Sungrae Park

Figure 1 for RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image

Figure 2 for RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image

Figure 3 for RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image

Figure 4 for RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image

Abstract:Scene text editing (STE), which converts a text in a scene image into the desired text while preserving an original style, is a challenging task due to a complex intervention between text and style. To address this challenge, we propose a novel representational learning-based STE model, referred to as RewriteNet that employs textual information as well as visual information. We assume that the scene text image can be decomposed into content and style features where the former represents the text information and style represents scene text characteristics such as font, alignment, and background. Under this assumption, we propose a method to separately encode content and style features of the input image by introducing the scene text recognizer that is trained by text information. Then, a text-edited image is generated by combining the style feature from the original image and the content feature from the target text. Unlike previous works that are only able to use synthetic images in the training phase, we also exploit real-world images by proposing a self-supervised training scheme, which bridges the domain gap between synthetic and real data. Our experiments demonstrate that RewriteNet achieves better quantitative and qualitative performance than other comparisons. Moreover, we validate that the use of text information and the self-supervised training scheme improves text switching performance. The implementation and dataset will be publicly available.

Via

Access Paper or Ask Questions

SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models

Jul 20, 2021

Moonbin Yim, Yoonsik Kim, Han-Cheol Cho, Sungrae Park

Figure 1 for SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models

Figure 2 for SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models

Figure 3 for SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models

Figure 4 for SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models

Abstract:For successful scene text recognition (STR) models, synthetic text image generators have alleviated the lack of annotated text images from the real world. Specifically, they generate multiple text images with diverse backgrounds, font styles, and text shapes and enable STR models to learn visual patterns that might not be accessible from manually annotated data. In this paper, we introduce a new synthetic text image generator, SynthTIGER, by analyzing techniques used for text image synthesis and integrating effective ones under a single algorithm. Moreover, we propose two techniques that alleviate the long-tail problem in length and character distributions of training data. In our experiments, SynthTIGER achieves better STR performance than the combination of synthetic datasets, MJSynth (MJ) and SynthText (ST). Our ablation study demonstrates the benefits of using sub-components of SynthTIGER and the guideline on generating synthetic text images for STR models. Our implementation is publicly available at https://github.com/clovaai/synthtiger.

* Accepted at ICDAR 2021, 16 pages, 6 figures

Via

Access Paper or Ask Questions

Domain Generalization Needs Stochastic Weight Averaging for Robustness on Domain Shifts

Feb 17, 2021

Junbum Cha, Hancheol Cho, Kyungjae Lee, Seunghyun Park, Yunsung Lee, Sungrae Park

Figure 1 for Domain Generalization Needs Stochastic Weight Averaging for Robustness on Domain Shifts

Figure 2 for Domain Generalization Needs Stochastic Weight Averaging for Robustness on Domain Shifts

Figure 3 for Domain Generalization Needs Stochastic Weight Averaging for Robustness on Domain Shifts

Figure 4 for Domain Generalization Needs Stochastic Weight Averaging for Robustness on Domain Shifts

Abstract:Domain generalization aims to learn a generalizable model to unseen target domains from multiple source domains. Various approaches have been proposed to address this problem. However, recent benchmarks show that most of them do not provide significant improvements compared to the simple empirical risk minimization (ERM) in practical cases. In this paper, we analyze how ERM works in views of domain-invariant feature learning and domain-specific gradient normalization. In addition, we observe that ERM converges to a loss valley shared over multiple training domains and obtain an insight that a center of the valley generalizes better. To estimate the center, we employ stochastic weight averaging (SWA) and provide theoretical analysis describing how SWA supports the generalization bound for an unseen domain. As a result, we achieve state-of-the-art performances over all of widely used domain generalization benchmarks, namely PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet with large margins. Further analysis reveals how SWA operates on domain generalization tasks.

Via

Access Paper or Ask Questions

Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

Feb 05, 2021

Mingi Ji, Byeongho Heo, Sungrae Park

Figure 1 for Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

Figure 2 for Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

Figure 3 for Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

Figure 4 for Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

Abstract:Knowledge distillation extracts general knowledge from a pre-trained teacher network and provides guidance to a target student network. Most studies manually tie intermediate features of the teacher and student, and transfer knowledge through pre-defined links. However, manual selection often constructs ineffective links that limit the improvement from the distillation. There has been an attempt to address the problem, but it is still challenging to identify effective links under practical scenarios. In this paper, we introduce an effective and efficient feature distillation method utilizing all the feature levels of the teacher without manually selecting the links. Specifically, our method utilizes an attention-based meta-network that learns relative similarities between features, and applies identified similarities to control distillation intensities of all possible pairs. As a result, our method determines competent links more efficiently than the previous approach and provides better performance on model compression and transfer learning tasks. Further qualitative analyses and ablative studies describe how our method contributes to better distillation. The implementation code is available at github.com/clovaai/attention-feature-distillation.

* 10 page, AAAI 2021

Via

Access Paper or Ask Questions

Character Region Attention For Text Spotting

Jul 19, 2020

Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park, Junyeop Lee, Daehyun Nam, Hwalsuk Lee

Figure 1 for Character Region Attention For Text Spotting

Figure 2 for Character Region Attention For Text Spotting

Figure 3 for Character Region Attention For Text Spotting

Figure 4 for Character Region Attention For Text Spotting

Abstract:A scene text spotter is composed of text detection and recognition modules. Many studies have been conducted to unify these modules into an end-to-end trainable model to achieve better performance. A typical architecture places detection and recognition modules into separate branches, and a RoI pooling is commonly used to let the branches share a visual feature. However, there still exists a chance of establishing a more complimentary connection between the modules when adopting recognizer that uses attention-based decoder and detector that represents spatial information of the character regions. This is possible since the two modules share a common sub-task which is to find the location of the character regions. Based on the insight, we construct a tightly coupled single pipeline model. This architecture is formed by utilizing detection outputs in the recognizer and propagating the recognition loss through the detection stage. The use of character score map helps the recognizer attend better to the character center points, and the recognition loss propagation to the detector module enhances the localization of the character regions. Also, a strengthened sharing stage allows feature rectification and boundary localization of arbitrary-shaped text regions. Extensive experiments demonstrate state-of-the-art performance in publicly available straight and curved benchmark dataset.

* 17 pages, 9 figures, Accepted by ECCV 2020

Via

Access Paper or Ask Questions