Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Jan 03, 2022

Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Figure 1 for Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Figure 2 for Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Figure 3 for Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Figure 4 for Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Share this with someone who'll enjoy it:

Abstract:We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. We also compare various aspects of distillations including the usage of word embeddings and unlabeled data augmentation. Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource (CPU, RAM, and storage) compared to pruned BERT models. We further propose some quick wins on performing KD to produce small NLP models via efficient KD training mechanisms involving simple choices of loss functions, word embeddings, and unlabeled data preparation.

* 14 pages, 3 figures, submitted to Elsevier

View paper on

Share this with someone who'll enjoy it:

Title:Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Paper and Code