Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor development. Additionally, NNS activity has been proposed as a potential safeguard against sudden infant death syndrome (SIDS). However, the clinical application of NNS assessment is currently hindered by labor-intensive and subjective finger-in-mouth evaluations. Consequently, researchers often resort to expensive pressure transducers for objective NNS signal measurement. To enhance the accessibility and reliability of NNS signal monitoring for both clinicians and researchers, we introduce a vision-based algorithm designed for non-contact detection of NNS activity using baby monitor footage in natural settings. Our approach involves a comprehensive exploration of optical flow and temporal convolutional networks, enabling the detection and amplification of subtle infant-sucking signals. We successfully classify short video clips of uniform length into NNS and non-NNS periods. Furthermore, we investigate manual and learning-based techniques to piece together local classification results, facilitating the segmentation of longer mixed-activity videos into NNS and non-NNS segments of varying duration. Our research introduces two novel datasets of annotated infant videos, including one sourced from our clinical study featuring 19 infant subjects and 183 hours of overnight baby monitor footage.