Temporal and spatial features are both important for predicting the demands in the bike-sharing systems. Many relevant experiments in the literature support this. Meanwhile, it is observed that the data structure of spatial features with vector form is weaker in space than the videos, which have natural spatial structure. Therefore, to obtain more spatial features, this study introduces city map to generate GPS demand videos while employing a novel algorithm : eidetic 3D convolutional long short-term memory network named E3D-LSTM to process the video-level data in bike-sharing system. The spatio-temporal correlations and feature importance are experimented and visualized to validate the significance of spatial and temporal features. Despite the deep learning model is powerful in non-linear fitting ability, statistic model has better interpretation. This study adopts ensemble learning, which is a popular policy, to improve the performance and decrease variance. In this paper, we propose a novel model stacked by deep learning and statistical models, named the fusion multi-channel eidetic 3D convolutional long short-term memory network(FM-E3DCL-Net), to better process temporal and spatial features on the dataset about 100,000 transactions within one month in Shanghai of Mobike company. Furthermore, other factors like weather, holiday and time intervals are proved useful in addition to historical demand, since they decrease the root mean squared error (RMSE) by 29.4%. On this basis, the ensemble learning further decreases RMSE by 6.6%.