Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robbie Jones

Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation

Apr 01, 2022

Kendrick Shen, Robbie Jones, Ananya Kumar, Sang Michael Xie, Jeff Z. HaoChen, Tengyu Ma, Percy Liang

Figure 1 for Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation

Figure 2 for Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation

Figure 3 for Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation

Figure 4 for Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation

Abstract:We consider unsupervised domain adaptation (UDA), where labeled data from a source domain (e.g., photographs) and unlabeled data from a target domain (e.g., sketches) are used to learn a classifier for the target domain. Conventional UDA methods (e.g., domain adversarial training) learn domain-invariant features to improve generalization to the target domain. In this paper, we show that contrastive pre-training, which learns features on unlabeled source and target data and then fine-tunes on labeled source data, is competitive with strong UDA methods. However, we find that contrastive pre-training does not learn domain-invariant features, diverging from conventional UDA intuitions. We show theoretically that contrastive pre-training can learn features that vary subtantially across domains but still generalize to the target domain, by disentangling domain and class information. Our results suggest that domain invariance is not necessary for UDA. We empirically validate our theory on benchmark vision datasets.

* 35 pages

Via

Access Paper or Ask Questions

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Feb 21, 2022

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang

Figure 1 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Figure 2 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Figure 3 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Figure 4 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Abstract:When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

* ICLR (Oral) 2022

Via

Access Paper or Ask Questions

In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

Dec 16, 2020

Sang Michael Xie, Ananya Kumar, Robbie Jones, Fereshte Khani, Tengyu Ma, Percy Liang

Figure 1 for In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

Figure 2 for In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

Figure 3 for In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

Figure 4 for In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

Abstract:Consider a prediction setting where a few inputs (e.g., satellite images) are expensively annotated with the prediction targets (e.g., crop types), and many inputs are cheaply annotated with auxiliary information (e.g., climate information). How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt out-of-distribution (OOD) error; while (ii) using auxiliary information as outputs of auxiliary tasks to pre-train a model improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error.

Via

Access Paper or Ask Questions