The rapid development of remote sensing techniques provides rich, large-coverage, and high-temporal information of the ground, which can be coupled with the emerging deep learning approaches that enable latent features and hidden geographical patterns to be extracted. This study marks the first attempt to cross-compare performances of popular state-of-the-art deep learning models in estimating population distribution from remote sensing images, investigate the contribution of neighboring effect, and explore the potential systematic population estimation biases. We conduct an end-to-end training of four popular deep learning architectures, i.e., VGG, ResNet, Xception, and DenseNet, by establishing a mapping between Sentinel-2 image patches and their corresponding population count from the LandScan population grid. The results reveal that DenseNet outperforms the other three models, while VGG has the worst performances in all evaluating metrics under all selected neighboring scenarios. As for the neighboring effect, contradicting existing studies, our results suggest that the increase of neighboring sizes leads to reduced population estimation performance, which is found universal for all four selected models in all evaluating metrics. In addition, there exists a notable, universal bias that all selected deep learning models tend to overestimate sparsely populated image patches and underestimate densely populated image patches, regardless of neighboring sizes. The methodological, experimental, and contextual knowledge this study provides is expected to benefit a wide range of future studies that estimate population distribution via remote sensing imagery.