Image fusion in Remote Sensing (RS) has been a consistent demand due to its ability to turn raw images of different resolutions, sources, and modalities into accurate, complete, and spatio-temporally coherent images. It greatly facilitates downstream applications such as pan-sharpening, change detection, land-cover classification, etc. Yet, image fusion solutions are highly disparate to various remote sensing problems and thus are often narrowly defined in existing reviews as topical applications, such as pan-sharpening, and spatial-temporal image fusion. Considering that image fusion can be theoretically applied to any gridded data through pixel-level operations, in this paper, we expanded its scope by comprehensively surveying relevant works with a simple taxonomy: 1) many-to-one image fusion; 2) many-to-many image fusion. This simple taxonomy defines image fusion as a mapping problem that turns either a single or a set of images into another single or set of images, depending on the desired coherence, e.g., spectral, spatial/resolution coherence, etc. We show that this simple taxonomy, despite the significant modality difference it covers, can be presented by a conceptually easy framework. In addition, we provide a meta-analysis to review the major papers studying the various types of image fusion and their applications over the years (from the 1980s to date), covering 5,926 peer-reviewed papers. Finally, we discuss the main benefits and emerging challenges to provide open research directions and potential future works.