Information about the spatio-temporal pattern of electricity energy carried by EVs, instead of EVs themselves, is crucial for EVs to establish more effective and intelligent interactions with the smart grid. In this paper, we propose a framework for predicting the amount of the electricity energy stored by a large number of EVs aggregated within different city-scale regions, based on spatio-temporal pattern of the electricity energy. The spatial pattern is modeled via using a neural network based spatial predictor, while the temporal pattern is captured via using a linear-chain conditional random field (CRF) based temporal predictor. Two predictors are fed with spatial and temporal features respectively, which are extracted based on real trajectories data recorded in Beijing. Furthermore, we combine both predictors to build the spatio-temporal predictor, by using an optimal combination coefficient which minimizes the normalized mean square error (NMSE) of the predictions. The prediction performance is evaluated based on extensive experiments covering both spatial and temporal predictions, and the improvement achieved by the combined spatio-temporal predictor. The experiment results show that the NMSE of the spatio-temporal predictor is maintained below 0.1 for all investigate regions of Beijing. We further visualize the prediction and discuss the potential benefits can be brought to smart grid scheduling and EV charging by utilizing the proposed framework.