Abstract:Heatmap regression based face alignment algorithms have achieved prominent performance on static images. However, when applying these methods on videos or sequential images, the stability and accuracy are remarkably discounted. The reason lies in temporal informations are not considered, which is mainly reflected in network structure and loss function. This paper presents a novel backbone replaceable fine-tuning framework, which can swiftly convert facial landmark detector designed for single image level into a better performing one that suitable for videos. On this basis, we proposed the Jitter loss, an innovative temporal information based loss function devised to impose strong penalties on prediction landmarks that jitter around the ground truth. Our framework provides capabilities to achieve at least 40% performance improvement on stability evaluation metrices while enhancing accuracy without re-training the entire model versus state-of-the-art methods.