Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Oct 25, 2021

Shibal Ibrahim, Natalia Ponomareva, Rahul Mazumder

Figure 1 for Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Figure 2 for Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Figure 3 for Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Figure 4 for Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Share this with someone who'll enjoy it:

Abstract:Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. We show that the statistical problems with covariance estimation drive the poor performance of H-score [Bao et al., 2019] -- a common baseline for newer metrics -- and propose shrinkage-based estimator. This results in up to 80% absolute gain in H-score correlation performance, making it competitive with the state-of-the-art LogME measure by You et al. [2021]. Our shrinkage-based H-score is 3-55 times faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We identify previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., LEEP [Nguyen et al., 2020] that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We also outline the difficulties of comparing feature-dependent metrics, both supervised (e.g. H-score) and unsupervised measures (e.g., Maximum Mean Discrepancy [Long et al., 2015]), across source models/layers with different feature embedding dimension. We show that dimensionality reduction methods allow for meaningful comparison across models and improved performance of some of these measures. We investigate performance of 14 different supervised and unsupervised metrics and demonstrate that even unsupervised metrics can identify the leading models for domain adaptation. We support our findings with ~65,000 (fine-tuning trials) experiments.

* A smaller version of the paper has been accepted at NeurIPS 2021 Workshop on Distribution Shifts (DistShift)

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Paper and Code