Abstract:Self-supervised learning (SSL) has become an important approach in pretraining large neural networks, enabling unprecedented scaling of model and dataset sizes. While recent advances like I-JEPA have shown promising results for Vision Transformers, adapting such methods to Convolutional Neural Networks (CNNs) presents unique challenges. In this paper, we introduce CNN-JEPA, a novel SSL method that successfully applies the joint embedding predictive architecture approach to CNNs. Our method incorporates a sparse CNN encoder to handle masked inputs, a fully convolutional predictor using depthwise separable convolutions, and an improved masking strategy. We demonstrate that CNN-JEPA outperforms I-JEPA with ViT architectures on ImageNet-100, achieving 73.3% linear top-1 accuracy with a standard ResNet-50 encoder. Compared to other CNN-based SSL methods, CNN-JEPA requires 17-35% less training time for the same number of epochs and approaches the linear and k-NN top-1 accuracies of BYOL, SimCLR, and VICReg. Our approach offers a simpler, more efficient alternative to existing SSL methods for CNNs, requiring minimal augmentations and no separate projector network.
Abstract:Self-supervised learning (SSL) has been shown to be a powerful approach for learning visual representations. In this study, we propose incorporating ZCA whitening as the final layer of the encoder in self-supervised learning to enhance the quality of learned features by normalizing and decorrelating them. Although whitening has been utilized in SSL in previous works, its potential to universally improve any SSL model has not been explored. We demonstrate that adding whitening as the last layer of SSL pretrained encoders is independent of the self-supervised learning method and encoder architecture, thus it improves performance for a wide range of SSL methods across multiple encoder architectures and datasets. Our experiments show that whitening is capable of improving linear and k-NN probing accuracy by 1-5%. Additionally, we propose metrics that allow for a comprehensive analysis of the learned features, provide insights into the quality of the representations and help identify collapse patterns.
Abstract:In this work, we study vision-based end-to-end reinforcement learning on vehicle control problems, such as lane following and collision avoidance. Our controller policy is able to control a small-scale robot to follow the right-hand lane of a real two-lane road, while its training was solely carried out in a simulation. Our model, realized by a simple, convolutional network, only relies on images of a forward-facing monocular camera and generates continuous actions that directly control the vehicle. To train this policy we used Proximal Policy Optimization, and to achieve the generalization capability required for real performance we used domain randomization. We carried out thorough analysis of the trained policy, by measuring multiple performance metrics and comparing these to baselines that rely on other methods. To assess the quality of the simulation-to-reality transfer learning process and the performance of the controller in the real world, we measured simple metrics on a real track and compared these with results from a matching simulation. Further analysis was carried out by visualizing salient object maps.