Facial age estimation is an important and challenging problem in computer vision. Existing approaches usually employ deep neural networks to fit the mapping from facial features to age directly, even though there exist some noisy and confusing samples. We argue that it is more desirable to distinguish noisy and confusing facial images from regular ones, and suppress the interference arising from them. To this end, we propose self-paced deep regression forests (SP-DRFs) -- a gradual learning DNNs framework for age estimation. As the model is learned gradually, from easy to hard, it tends to be significantly more robust with emphasizing more on reliable samples and avoiding bad local minima. We demonstrate the efficacy of SP-DRFs on Morph II and FG-NET datasets, where our method is shown to achieve state-of-the-art performance.