Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Nov 25, 2024

Yao Fu, Yin Yu, Xiaotian Han, Runchao Li, Xianxuan Long, Haotian Yu, Pan Li

Figure 1 for Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Figure 2 for Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Figure 3 for Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Figure 4 for Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Share this with someone who'll enjoy it:

Abstract:Knowledge distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprints. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self-distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers' guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous minibatch (DynSDPB), which realizes current iterations' distillation from the last ones' generated logits. Additionally, to address prediction inaccuracies during the early iterations, we dynamically adjust the distillation influence and temperature values to enhance the adaptability of fine-tuning. Furthermore, DynSDPB is a novel fine-tuning policy that facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs) because they all require updating SLMs' parameters. We demonstrate the superior performance of DynSDPB on both encoder-only LMs (e.g., BERT model families) and decoder-only LMs (e.g., LLaMA model families), validating its effectiveness across natural language understanding (NLU) and natural language generation (NLG) benchmarks.

* Work in progress

View paper on

Share this with someone who'll enjoy it:

Title:Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

Paper and Code