Convolutional Neural Networks (CNNs) have been consistently proved state-of-the-art results in image Super-Resolution (SR), representing an exceptional opportunity for the remote sensing field to extract further information and knowledge from captured data. However, most of the works published in the literature have been focusing on the Single-Image Super-Resolution problem so far. At present, satellite based remote sensing platforms offer huge data availability with high temporal resolution and low spatial resolution. In this context, the presented research proposes a novel residual attention model (RAMS) that efficiently tackles the multi-image super-resolution task, simultaneously exploiting spatial and temporal correlations to combine multiple images. We introduce the mechanism of visual feature attention with 3D convolutions in order to obtain an aware data fusion and information extraction of the multiple low-resolution images, transcending limitations of the local region of convolutional operations. Moreover, having multiple inputs with the same scene, our representation learning network makes extensive use of nestled residual connections to let flow redundant low-frequency signals and focus the computation on more important high-frequency components. Extensive experimentation and evaluations against other available solutions, either for single or multi-image super-resolution, have demonstrated that the proposed deep learning-based solution can be considered state-of-the-art for Multi-Image Super-Resolution for remote sensing applications.