Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Mar 13, 2024

Gautham Govind Anil, Pascal Esser, Debarghya Ghoshdastidar

Figure 1 for When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Figure 2 for When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Figure 3 for When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Figure 4 for When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Share this with someone who'll enjoy it:

Abstract:Contrastive learning is a paradigm for learning representations from unlabelled data that has been highly successful for image and text data. Several recent works have examined contrastive losses to claim that contrastive models effectively learn spectral embeddings, while few works show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. In this work, we analyze the training dynamics of two-layer contrastive models, with non-linear activation, and answer when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines, and that the NTK of infinitely wide networks remains constant during training. We provide the first convergence results of NTK for contrastive losses, and present a nuanced picture: NTK of wide networks remains almost constant for cosine similarity based contrastive losses, but not for losses based on dot product similarity. We further study the training dynamics of contrastive models with orthogonality constraints on output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. Our deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features. We empirically show that our theoretical results possibly hold beyond two-layer networks.

View paper on

Share this with someone who'll enjoy it:

Title:When can we Approximate Wide Contrastive Models with Neural Tangent Kernels and Principal Component Analysis?

Paper and Code