Classifying multi-temporal scene land-use categories and detecting their semantic scene-level changes for imagery covering urban regions could straightly reflect the land-use transitions. Existing methods for scene change detection rarely focus on the temporal correlation of bi-temporal features, and are mainly evaluated on small scale scene change detection datasets. In this work, we proposed a CorrFusion module that fuses the highly correlated components in bi-temporal feature embeddings. We firstly extracts the deep representations of the bi-temporal inputs with deep convolutional networks. Then the extracted features will be projected into a lower dimension space to computed the instance-level correlation. The cross-temporal fusion will be performed based on the computed correlation in CorrFusion module. The final scene classification are obtained with softmax activation layers. In the objective function, we introduced a new formulation for calculating the temporal correlation. The detailed derivation of backpropagation gradients for the proposed module is also given in this paper. Besides, we presented a much larger scale scene change detection dataset and conducted experiments on this dataset. The experimental results demonstrated that our proposed CorrFusion module could remarkably improve the multi-temporal scene classification and scene change detection results.