The objective of this paper is self-supervised learning of feature embeddings from videos, suitable for correspondence flow, i.e. matching correspondences between frames over the video. We leverage the natural spatial-temporal coherence of appearance in videos, to create a "pointer" model that learns to reconstruct a target frame by copying pixels from a reference frame. We make three contributions: First, we introduce a simple information bottleneck that forces the model to learn robust features for correspondence matching, and to avoid learning trivial solutions, e.g. matching based on low-level colour information. Second, we propose to train the model over a long temporal window in videos, thus making the model more robust to complex object deformation, occlusion, which usually leads to the well-known problem of tracker drifting, To do this, we formulate a recursive model, trained with scheduled sampling and cycle consistency. Third, we achieve the state-of-the-art performance on DAVIS video segmentation and JHMDB keypoint tracking tasks, outperforming previous self-supervised learning approaches by a significant margin. Moreover, in order to shed light on the potential of self-supervised learning on the task of correspondence flow, we probe the upper bound by training on more diverse video data, further demonstrating a significant improvement. The source code will be released upon acceptance.