Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Sep 18, 2023

Suhas Kotha, Jacob Mitchell Springer, Aditi Raghunathan

Figure 1 for Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Figure 2 for Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Figure 3 for Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Figure 4 for Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Share this with someone who'll enjoy it:

Abstract:Fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback) is a crucial step in training language models to robustly carry out tasks of interest. However, we lack a systematic understanding of the effects of fine-tuning, particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of suppressing model capabilities on other tasks. This degradation is especially pronounced for tasks "closest" to the fine-tuning distribution. We hypothesize that language models implicitly infer the task of the prompt corresponds, and the fine-tuning process predominantly skews this task inference towards tasks in the fine-tuning distribution. To test this hypothesis, we propose Conjugate Prompting to see if we can recover pretrained capabilities. Conjugate prompting artificially makes the task look farther from the fine-tuning distribution while requiring the same capability. We find that conjugate prompting systematically recovers some of the pretraining capabilities on our synthetic setup. We then apply conjugate prompting to real-world LLMs using the observation that fine-tuning distributions are typically heavily skewed towards English. We find that simply translating the prompts to different languages can cause the fine-tuned models to respond like their pretrained counterparts instead. This allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, to recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.

View paper on

Share this with someone who'll enjoy it:

Title:Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Paper and Code