Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kang-Min Kim

Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking

Dec 15, 2022

Mingyu Lee, Jun-Hyung Park, Junho Kim, Kang-Min Kim, SangKeun Lee

Abstract:Masked language modeling (MLM) has been widely used for pre-training effective bidirectional representations, but incurs substantial training costs. In this paper, we propose a novel concept-based curriculum masking (CCM) method to efficiently pre-train a language model. CCM has two key differences from existing curriculum learning approaches to effectively reflect the nature of MLM. First, we introduce a carefully-designed linguistic difficulty criterion that evaluates the MLM difficulty of each token. Second, we construct a curriculum that gradually masks words related to the previously masked words by retrieving a knowledge graph. Experimental results show that CCM significantly improves pre-training efficiency. Specifically, the model trained with CCM shows comparative performance with the original BERT on the General Language Understanding Evaluation benchmark at half of the training cost.

* EMNLP 2022

Via

Access Paper or Ask Questions

Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Apr 03, 2018

Kang-Min Kim, Aliyeva Dinara, Byung-Ju Choi, SangKeun Lee

Figure 1 for Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Figure 2 for Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Figure 3 for Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Figure 4 for Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Abstract:Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model and word embeddings, respectively. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.

* 12 pages, 2 figures, In proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)

Via

Access Paper or Ask Questions