Abstract:Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. To the best of our knowledge, there are no publicly available datasets that contain pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches. See the project page for more results: https://chaoren2357.github.io/website-skyeyes/.
Abstract:A critical step in the digital surface models(DSM) generation is feature matching. Off-track (or multi-date) satellite stereo images, in particular, can challenge the performance of feature matching due to spectral distortions between images, long baseline, and wide intersection angles. Feature matching methods have evolved over the years from handcrafted methods (e.g., SIFT) to learning-based methods (e.g., SuperPoint and SuperGlue). In this paper, we compare the performance of different features, also known as feature extraction and matching methods, applied to satellite imagery. A wide range of stereo pairs(~500) covering two separate study sites are used. SIFT, as a widely used classic feature extraction and matching algorithm, is compared with seven deep-learning matching methods: SuperGlue, LightGlue, LoFTR, ASpanFormer, DKM, GIM-LightGlue, and GIM-DKM. Results demonstrate that traditional matching methods are still competitive in this age of deep learning, although for particular scenarios learning-based methods are very promising.
Abstract:Modeling outdoor scenes for the synthetic 3D environment requires the recovery of reflectance/albedo information from raw images, which is an ill-posed problem due to the complicated unmodeled physics in this process (e.g., indirect lighting, volume scattering, specular reflection). The problem remains unsolved in a practical context. The recovered albedo can facilitate model relighting and shading, which can further enhance the realism of rendered models and the applications of digital twins. Typically, photogrammetric 3D models simply take the source images as texture materials, which inherently embed unwanted lighting artifacts (at the time of capture) into the texture. Therefore, these polluted textures are suboptimal for a synthetic environment to enable realistic rendering. In addition, these embedded environmental lightings further bring challenges to photo-consistencies across different images that cause image-matching uncertainties. This paper presents a general image formation model for albedo recovery from typical aerial photogrammetric images under natural illuminations and derives the inverse model to resolve the albedo information through inverse rendering intrinsic image decomposition. Our approach builds on the fact that both the sun illumination and scene geometry are estimable in aerial photogrammetry, thus they can provide direct inputs for this ill-posed problem. This physics-based approach does not require additional input other than data acquired through the typical drone-based photogrammetric collection and was shown to favorably outperform existing approaches. We also demonstrate that the recovered albedo image can in turn improve typical image processing tasks in photogrammetry such as feature and dense matching, edge, and line extraction.
Abstract:Establishing consistent and dense correspondences across multiple images is crucial for Structure from Motion (SfM) systems. Significant view changes, such as air-to-ground with very sparse view overlap, pose an even greater challenge to the correspondence solvers. We present a novel optimization-based approach that significantly enhances existing feature matching methods by introducing geometry cues in addition to color cues. This helps fill gaps when there is less overlap in large-scale scenarios. Our method formulates geometric verification as an optimization problem, guiding feature matching within detector-free methods and using sparse correspondences from detector-based methods as anchor points. By enforcing geometric constraints via the Sampson Distance, our approach ensures that the denser correspondences from detector-free methods are geometrically consistent and more accurate. This hybrid strategy significantly improves correspondence density and accuracy, mitigates multi-view inconsistencies, and leads to notable advancements in camera pose accuracy and point cloud density. It outperforms state-of-the-art feature matching methods on benchmark datasets and enables feature matching in challenging extreme large-scale settings.
Abstract:Predicting realistic ground views from satellite imagery in urban scenes is a challenging task due to the significant view gaps between satellite and ground-view images. We propose a novel pipeline to tackle this challenge, by generating geospecifc views that maximally respect the weak geometry and texture from multi-view satellite images. Different from existing approaches that hallucinate images from cues such as partial semantics or geometry from overhead satellite images, our method directly predicts ground-view images at geolocation by using a comprehensive set of information from the satellite image, resulting in ground-level images with a resolution boost at a factor of ten or more. We leverage a novel building refinement method to reduce geometric distortions in satellite data at ground level, which ensures the creation of accurate conditions for view synthesis using diffusion networks. Moreover, we proposed a novel geospecific prior, which prompts distribution learning of diffusion models to respect image samples that are closer to the geolocation of the predicted images. We demonstrate our pipeline is the first to generate close-to-real and geospecific ground views merely based on satellite images.
Abstract:Remotely sensed image high-accuracy interpretation (RSIHI), including tasks such as semantic segmentation and change detection, faces the three major problems: (1) complementarity problem of spatially stationary-and-non-stationary frequency; (2) edge uncertainty problem caused by down-sampling in the encoder step and intrinsic edge noises; and (3) false detection problem caused by imagery registration error in change detection. To solve the aforementioned problems, an uncertainty-diffusion-model-based high-Frequency TransFormer network (UDHF2-Net) is the proposed for RSIHI, the superiority of which is as following: (1) a spatially-stationary-and-non-stationary high-frequency connection paradigm (SHCP) is proposed to enhance the interaction of spatially stationary and non-stationary frequency features to yield high-fidelity edge extraction result. Inspired by HRFormer, SHCP remains the high-frequency stream through the whole encoder-decoder process with parallel high-to-low frequency streams and reduces the edge loss by a downsampling operation; (2) a mask-and-geo-knowledge-based uncertainty diffusion module (MUDM) is proposed to improve the robustness and edge noise resistance. MUDM could further optimize the uncertain region to improve edge extraction result by gradually removing the multiple geo-knowledge-based noises; (3) a semi-pseudo-Siamese UDHF2-Net for change detection task is proposed to reduce the pseudo change by registration error. It adopts semi-pseudo-Siamese architecture to extract above complemental frequency features for adaptively reducing registration differencing, and MUDM to recover the uncertain region by gradually reducing the registration error besides above edge noises. Comprehensive experiments were performed to demonstrate the superiority of UDHF2-Net. Especially ablation experiments indicate the effectiveness of UDHF2-Net.
Abstract:Generating wide-area digital surface models (DSMs) requires registering a large number of individual, and partially overlapped DSMs. This presents a challenging problem for a typical registration algorithm, since when a large number of observations from these multiple DSMs are considered, it may easily cause memory overflow. Sequential registration algorithms, although can significantly reduce the computation, are especially vulnerable for small overlapped pairs, leading to a large error accumulation. In this work, we propose a novel solution that builds the DSM registration task as a motion averaging problem: pair-wise DSMs are registered to build a scene graph, with edges representing relative poses between DSMs. Specifically, based on the grid structure of the large DSM, the pair-wise registration is performed using a novel nearest neighbor search method. We show that the scene graph can be optimized via an extremely fast motion average algorithm with O(N) complexity (N refers to the number of images). Evaluation of high-resolution satellite-derived DSM demonstrates significant improvement in computation and accuracy.
Abstract:Image fusion in Remote Sensing (RS) has been a consistent demand due to its ability to turn raw images of different resolutions, sources, and modalities into accurate, complete, and spatio-temporally coherent images. It greatly facilitates downstream applications such as pan-sharpening, change detection, land-cover classification, etc. Yet, image fusion solutions are highly disparate to various remote sensing problems and thus are often narrowly defined in existing reviews as topical applications, such as pan-sharpening, and spatial-temporal image fusion. Considering that image fusion can be theoretically applied to any gridded data through pixel-level operations, in this paper, we expanded its scope by comprehensively surveying relevant works with a simple taxonomy: 1) many-to-one image fusion; 2) many-to-many image fusion. This simple taxonomy defines image fusion as a mapping problem that turns either a single or a set of images into another single or set of images, depending on the desired coherence, e.g., spectral, spatial/resolution coherence, etc. We show that this simple taxonomy, despite the significant modality difference it covers, can be presented by a conceptually easy framework. In addition, we provide a meta-analysis to review the major papers studying the various types of image fusion and their applications over the years (from the 1980s to date), covering 5,926 peer-reviewed papers. Finally, we discuss the main benefits and emerging challenges to provide open research directions and potential future works.
Abstract:Neural Radiance Fields (NeRF) offer the potential to benefit 3D reconstruction tasks, including aerial photogrammetry. However, the scalability and accuracy of the inferred geometry are not well-documented for large-scale aerial assets,since such datasets usually result in very high memory consumption and slow convergence.. In this paper, we aim to scale the NeRF on large-scael aerial datasets and provide a thorough geometry assessment of NeRF. Specifically, we introduce a location-specific sampling technique as well as a multi-camera tiling (MCT) strategy to reduce memory consumption during image loading for RAM, representation training for GPU memory, and increase the convergence rate within tiles. MCT decomposes a large-frame image into multiple tiled images with different camera models, allowing these small-frame images to be fed into the training process as needed for specific locations without a loss of accuracy. We implement our method on a representative approach, Mip-NeRF, and compare its geometry performance with threephotgrammetric MVS pipelines on two typical aerial datasets against LiDAR reference data. Both qualitative and quantitative results suggest that the proposed NeRF approach produces better completeness and object details than traditional approaches, although as of now, it still falls short in terms of accuracy.
Abstract:Recent deep learning-based methods outperform traditional learning methods on remote sensing (RS) semantic segmentation/classification tasks. However, they require large training datasets and are generally known for lack of transferability due to the highly disparate RS image content across different geographical regions. Yet, there is no comprehensive analysis of their transferability, i.e., to which extent a model trained on a source domain can be readily applicable to a target domain. Therefore, in this paper, we aim to investigate the raw transferability of traditional and deep learning (DL) models, as well as the effectiveness of domain adaptation (DA) approaches in enhancing the transferability of the DL models (adapted transferability). By utilizing four highly diverse RS datasets, we train six models with and without three DA approaches to analyze their transferability between these datasets quantitatively. Furthermore, we developed a straightforward method to quantify the transferability of a model using the spectral indices as a medium and have demonstrated its effectiveness in evaluating the model transferability at the target domain when the labels are unavailable. Our experiments yield several generally important yet not well-reported observations regarding the raw and adapted transferability. Moreover, our proposed label-free transferability assessment method is validated to be better than posterior model confidence. The findings can guide the future development of generalized RS learning models. The trained models are released under this link: https://github.com/GDAOSU/Transferability-Remote-Sensing