Abstract:Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
Abstract:Scene text recognition is a popular topic and extensively used in the industry. Although many methods have achieved satisfactory performance for the close-set text recognition challenges, these methods lose feasibility in open-set scenarios, where collecting data or retraining models for novel characters is too expensive. E.g., annotating samples for foreign languages can be expensive, whereas retraining the model each time a "novel" character is discovered from historical documents also costs time and resources. In this paper, we introduce and formulate a new task, i.e., the open-set text recognition task, which demands the capability to spot and cognize novel characters without retraining. Here, we propose a label-to-prototype learning framework that fulfills the new requirements in the proposed task. Specifically, novel characters are mapped to their corresponding prototypes with a Label-to-Prototype Learning module. The module is trained on seen labels and holds generalization capability for generating class centers for novel characters without retraining. The framework also implements rejection capability over out-of-set characters, which allows spotting unknown characters during the evaluation process. Extensive experiments show that our method achieves promising performance on a variety of zero-shot, close-set, and open-set text recognition datasets.