Distilling analytical models from data has the potential to advance our understanding and prediction of nonlinear dynamics. Although discovery of governing equations based on observed system states (e.g., trajectory time series) has revealed success in a wide range of nonlinear dynamics, uncovering the closed-form equations directly from raw videos still remains an open challenge. To this end, we introduce a novel end-to-end unsupervised deep learning framework to uncover the mathematical structure of equations that governs the dynamics of moving objects in videos. Such an architecture consists of (1) an encoder-decoder network that learns low-dimensional spatial/pixel coordinates of the moving object, (2) a learnable Spatial-Physical Transformation component that creates mapping between the extracted spatial/pixel coordinates and the latent physical states of dynamics, and (3) a numerical integrator-based sparse regression module that uncovers the parsimonious closed-form governing equations of learned physical states and, meanwhile, serves as a constraint to the autoencoder. The efficacy of the proposed method is demonstrated by uncovering the governing equations of a variety of nonlinear dynamical systems depicted by moving objects in videos. The resulting computational framework enables discovery of parsimonious interpretable model in a flexible and accessible sensing environment where only videos are available.