Abstract:As climate change intensifies, the global imperative to shift towards sustainable energy sources becomes more pronounced. Photovoltaic (PV) energy is a favored choice due to its reliability and ease of installation. Accurate mapping of PV installations is crucial for understanding their adoption and informing energy policy. To meet this need, we introduce the SolarFormer, designed to segment solar panels from aerial imagery, offering insights into their location and size. However, solar panel identification in Computer Vision is intricate due to various factors like weather conditions, roof conditions, and Ground Sampling Distance (GSD) variations. To tackle these complexities, we present the SolarFormer, featuring a multi-scale Transformer encoder and a masked-attention Transformer decoder. Our model leverages low-level features and incorporates an instance query mechanism to enhance the localization of solar PV installations. We rigorously evaluated our SolarFormer using diverse datasets, including GGE (France), IGN (France), and USGS (California, USA), across different GSDs. Our extensive experiments consistently demonstrate that our model either matches or surpasses state-of-the-art models, promising enhanced solar panel segmentation for global sustainable energy initiatives.
Abstract:Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion
Abstract:Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path. Our AerialFormer is designed as a hierarchical structure, in which Transformer encoder outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales. Thus, it takes both local and global contexts into consideration to render powerful representations and high-resolution segmentation. We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that our proposed AerialFormer outperforms previous state-of-the-art methods with remarkable performance. Our source code will be publicly available upon acceptance.