The lip movements information is critical for many audio-visual tasks. However, extracting lip movements information from videos is challenging, as it can be easily perturbed by factors like personal identities and head poses. This paper proposes utilizing the parametric 3D face model to disentangle lip movements information explicitly. Building on top of the recent 3D face reconstruction advances, we firstly offer a method that can consistently disentangle expression information, where the lip movements information lies. Then we demonstrate that once the influences of perturbing factors are alleviated by synthesizing faces with the disentangled lip movements information, the lip-sync task can be done better with much fewer data. Finally, we show its effectiveness in the wild by testing it on an unseen dataset for the active speaker detection task and achieving competitive performance.