In this paper, we proposed a framework of constructing two types of the automatic image aesthetics assessment models with different CNN architectures and improving the performance of the image's aesthetic score prediction by the ensemble. Moreover, the attention regions of the models to the images are extracted to analyze the consistency with the subjects in the images. The experimental results verify that the proposed method is effective for improving the AS prediction. Moreover, it is found that the AS classification models trained on XiheAA dataset seem to learn the latent photography principles, although it can't be said that they learn the aesthetic sense.