This paper proposes a pre-trained neural network for handling event camera data. Our model is trained in a self-supervised learning framework, and uses paired event camera data and natural RGB images for training. Our method contains three modules connected in a sequence: i) a family of event data augmentations, generating meaningful event images for self-supervised training; ii) a conditional masking strategy to sample informative event patches from event images, encouraging our model to capture the spatial layout of a scene and fast training; iii) a contrastive learning approach, enforcing the similarity of embeddings between matching event images, and between paired event-RGB images. An embedding projection loss is proposed to avoid the model collapse when enforcing event embedding similarities. A probability distribution alignment loss is proposed to encourage the event data to be consistent with its paired RGB image in feature space. Transfer performance in downstream tasks shows superior performance of our method over state-of-the-art methods. For example, we achieve top-1 accuracy at 64.83\% on the N-ImageNet dataset.