Abstract:In this thesis, we propose a pioneering work on sparse keypoints tracking across images using transformer networks. While deep learning-based keypoints matching have been widely investigated using graph neural networks - and more recently transformer networks, they remain relatively too slow to operate in real-time and are particularly sensitive to the poor repeatability of the keypoints detectors. In order to address these shortcomings, we propose to study the particular case of real-time and robust keypoints tracking. Specifically, we propose a novel architecture which ensures a fast and robust estimation of the keypoints tracking between successive images of a video sequence. Our method takes advantage of a recent breakthrough in computer vision, namely, visual transformer networks. Our method consists of two successive stages, a coarse matching followed by a fine localization of the keypoints' correspondences prediction. Through various experiments, we demonstrate that our approach achieves competitive results and demonstrates high robustness against adverse conditions, such as illumination change, occlusion and viewpoint differences.