4D seismic imaging has been widely used in CO$_2$ sequestration projects to monitor the fluid flow in the volumetric subsurface region that is not sampled by wells. Ideally, real-time monitoring and near-future forecasting would provide site operators with great insights to understand the dynamics of the subsurface reservoir and assess any potential risks. However, due to obstacles such as high deployment cost, availability of acquisition equipment, exclusion zones around surface structures, only very sparse seismic imaging data can be obtained during monitoring. That leads to an unavoidable and growing knowledge gap over time. The operator needs to understand the fluid flow throughout the project lifetime and the seismic data are only available at a limited number of times, this is insufficient for understanding the reservoir behavior. To overcome those challenges, we have developed spatio-temporal neural-network-based models that can produce high-fidelity interpolated or extrapolated images effectively and efficiently. Specifically, our models are built on an autoencoder, and incorporate the long short-term memory (LSTM) structure with a new loss function regularized by optical flow. We validate the performance of our models using real 4D post-stack seismic imaging data acquired at the Sleipner CO$_2$ sequestration field. We employ two different strategies in evaluating our models. Numerically, we compare our models with different baseline approaches using classic pixel-based metrics. We also conduct a blind survey and collect a total of 20 responses from domain experts to evaluate the quality of data generated by our models. Via both numerical and expert evaluation, we conclude that our models can produce high-quality 2D/3D seismic imaging data at a reasonable cost, offering the possibility of real-time monitoring or even near-future forecasting of the CO$_2$ storage reservoir.