This paper approaches the unsupervised learning problem by minimizing the second-order Wasserstein loss (the $W_2$ loss). The minimization is characterized by a distribution-dependent ordinary differential equation (ODE), whose dynamics involves the Kantorovich potential between a current estimated distribution and the true data distribution. A main result shows that the time-marginal law of the ODE converges exponentially to the true data distribution. To prove that the ODE has a unique solution, we first construct explicitly a solution to the associated nonlinear Fokker-Planck equation and show that it coincides with the unique gradient flow for the $W_2$ loss. Based on this, a unique solution to the ODE is built from Trevisan's superposition principle and the exponential convergence results. An Euler scheme is proposed for the distribution-dependent ODE and it is shown to correctly recover the gradient flow for the $W_2$ loss in the limit. An algorithm is designed by following the scheme and applying persistent training, which is natural in our gradient-flow framework. In both low- and high-dimensional experiments, our algorithm converges much faster than and outperforms Wasserstein generative adversarial networks, by increasing the level of persistent training appropriately.