Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Oct 07, 2020

Amrit Nagarajan, Sanchari Sen, Jacob R. Stevens, Anand Raghunathan

Figure 1 for Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Figure 2 for Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Figure 3 for Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Figure 4 for Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Share this with someone who'll enjoy it:

Abstract:Transformer models have garnered a lot of interest in recent years by delivering state-of-the-art performance in a range of Natural Language Processing (NLP) tasks. However, these models can have over a hundred billion parameters, presenting very high computational and memory requirements. We address this challenge through Approximate Computing, specifically targeting the use of Transformers in NLP tasks. Transformers are typically pre-trained and subsequently specialized for specific tasks through transfer learning. Based on the observation that pre-trained Transformers are often over-parameterized for several downstream NLP tasks, we propose a framework to create smaller, faster and in some cases more accurate models. The key cornerstones of the framework are a Significance Analysis (SA) method that identifies components in a pre-trained Transformer that are less significant for a given task, and techniques to approximate the less significant components. Our approximations include pruning of blocks, attention heads and weight groups, quantization of less significant weights and a low-complexity sign-matching based attention mechanism. Our framework can be adapted to produce models that are faster, smaller and/or more accurate, depending on the user's constraints. We apply our framework to seven Transformer models, including optimized models like DistilBERT and Q8BERT, and three downstream tasks. We demonstrate that our framework produces models that are up to 4x faster and up to 14x smaller (with less than 0.5% relative accuracy degradation), or up to 5.5% more accurate with simultaneous improvements of up to 9.83x in model size or 2.94x in speed.

View paper on

Share this with someone who'll enjoy it:

Title:Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Paper and Code