Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weigang Qi

Masked Vision-Language Transformers for Scene Text Recognition

Nov 09, 2022

Jie Wu, Ying Peng, Shengming Zhang, Weigang Qi, Jian Zhang

Figure 1 for Masked Vision-Language Transformers for Scene Text Recognition

Figure 2 for Masked Vision-Language Transformers for Scene Text Recognition

Figure 3 for Masked Vision-Language Transformers for Scene Text Recognition

Figure 4 for Masked Vision-Language Transformers for Scene Text Recognition

Abstract:Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.

* The paper is accepted by the 33rd British Machine Vision Conference (BMVC 2022)

Via

Access Paper or Ask Questions