Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xingxing Weng

Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

May 20, 2025

Xingxing Weng, Chao Pang, Gui-Song Xia

Abstract:Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance across a variety of remote sensing data analysis tasks. Moreover, they are capable of interacting with users in a conversational manner. In this paper, we aim to provide the remote sensing community with a timely and comprehensive review of the developments in VLM using the two-stage paradigm. Specifically, we first cover a taxonomy of VLM in remote sensing: contrastive learning, visual instruction tuning, and text-conditioned image generation. For each category, we detail the commonly used network architecture and pre-training objectives. Second, we conduct a thorough review of existing works, examining foundation models and task-specific adaptation methods in contrastive-based VLM, architectural upgrades, training strategies and model capabilities in instruction-based VLM, as well as generative foundation models with their representative downstream applications. Third, we summarize datasets used for VLM pre-training, fine-tuning, and evaluation, with an analysis of their construction methodologies (including image sources and caption generation) and key properties, such as scale and task adaptability. Finally, we conclude this survey with insights and discussions on future research directions: cross-modal representation alignment, vague requirement comprehension, explanation-driven model reliability, continually scalable model capabilities, and large-scale datasets featuring richer modalities and greater challenges.

* Accepted by IEEE Geoscience and Remote Sensing Magazine

Via

Access Paper or Ask Questions

Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation

Aug 21, 2024

Chuandong Liu, Shuguo Jiang, Xingxing Weng, Lei Yu, Pengcheng Li, Gui-Song Xia

Figure 1 for Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation

Figure 2 for Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation

Figure 3 for Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation

Figure 4 for Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation

Abstract:Semi-supervised semantic segmentation, which efficiently addresses the limitation of acquiring dense annotations, is essential for 3D scene understanding. Most methods leverage the teacher model to generate pseudo labels, and then guide the learning of the student model on unlabeled scenes. However, they focus only on points with pseudo labels while directly overlooking points without pseudo labels, namely intra-scene inconsistency, leading to semantic ambiguity. Moreover, inter-scene correlation between labeled and unlabeled scenes contribute to transferring rich annotation information, yet this has not been explored for the semi-supervised tasks. To address these two problems, we propose to explore scene coherence for semi-supervised 3D semantic segmentation, dubbed CoScene. Inspired by the unstructured and unordered nature of the point clouds, our CoScene adopts the straightforward point erasure strategy to ensure the intra-scene consistency. Moreover, patch-based data augmentation is proposed to enhance the inter-scene information transfer between labeled and unlabeled scenes at both scene and instance levels. Extensive experimental results on SemanticKITTI and nuScenes show that our approach outperforms existing methods.

Via

Access Paper or Ask Questions

H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

Mar 29, 2024

Chao Pang, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Xingxing Weng, Shuai Wang, Litong Feng, Gui-Song Xia(+1 more)

Figure 1 for H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

Figure 2 for H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

Figure 3 for H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

Figure 4 for H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

Abstract:The generic large Vision-Language Models (VLMs) is rapidly developing, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack of large-scale, high-quality RS vision-language datasets. We constructed HqDC-1.4M, the large scale High quality and Detailed Captions for RS images, containing 1.4 million image-caption pairs, which not only enhance the RSVLM's understanding of RS images but also significantly improve the model's spatial perception abilities, such as localization and counting, thereby increasing the helpfulness of the RSVLM. Moreover, to address the inevitable "hallucination" problem in RSVLM, we developed RSSA, the first dataset aimed at enhancing the Self-Awareness capability of RSVLMs. By incorporating a variety of unanswerable questions into typical RS visual question-answering tasks, RSSA effectively improves the truthfulness and reduces the hallucinations of the model's outputs, thereby enhancing the honesty of the RSVLM. Based on these datasets, we proposed the H2RSVLM, the Helpful and Honest Remote Sensing Vision Language Model. H2RSVLM has achieved outstanding performance on multiple RS public datasets and is capable of recognizing and refusing to answer the unanswerable questions, effectively mitigating the incorrect generations. We will release the code, data and model weights at https://github.com/opendatalab/H2RSVLM .

* Equal contribution: Chao Pang, Jiang Wu; Corresponding author: Gui-Song Xia, Conghui He

Via

Access Paper or Ask Questions

HiCD: Change Detection in Quality-Varied Images via Hierarchical Correlation Distillation

Jan 19, 2024

Chao Pang, Xingxing Weng, Jiang Wu, Qiang Wang, Gui-Song Xia

Abstract:Advanced change detection techniques primarily target image pairs of equal and high quality. However, variations in imaging conditions and platforms frequently lead to image pairs with distinct qualities: one image being high-quality, while the other being low-quality. These disparities in image quality present significant challenges for understanding image pairs semantically and extracting change features, ultimately resulting in a notable decline in performance. To tackle this challenge, we introduce an innovative training strategy grounded in knowledge distillation. The core idea revolves around leveraging task knowledge acquired from high-quality image pairs to guide the model's learning process when dealing with image pairs that exhibit differences in quality. Additionally, we develop a hierarchical correlation distillation approach (involving self-correlation, cross-correlation, and global correlation). This approach compels the student model to replicate the correlations inherent in the teacher model, rather than focusing solely on individual features. This ensures effective knowledge transfer while maintaining the student model's training flexibility.

* accepted by TGRS

Via

Access Paper or Ask Questions

DAM-Net: Global Flood Detection from SAR Imagery Using Differential Attention Metric-Based Vision Transformers

Jun 01, 2023

Tamer Saleh, Xingxing Weng, Shimaa Holail, Chen Hao, Gui-Song Xia

Figure 1 for DAM-Net: Global Flood Detection from SAR Imagery Using Differential Attention Metric-Based Vision Transformers

Figure 2 for DAM-Net: Global Flood Detection from SAR Imagery Using Differential Attention Metric-Based Vision Transformers

Figure 3 for DAM-Net: Global Flood Detection from SAR Imagery Using Differential Attention Metric-Based Vision Transformers

Figure 4 for DAM-Net: Global Flood Detection from SAR Imagery Using Differential Attention Metric-Based Vision Transformers

Abstract:The detection of flooded areas using high-resolution synthetic aperture radar (SAR) imagery is a critical task with applications in crisis and disaster management, as well as environmental resource planning. However, the complex nature of SAR images presents a challenge that often leads to an overestimation of the flood extent. To address this issue, we propose a novel differential attention metric-based network (DAM-Net) in this study. The DAM-Net comprises two key components: a weight-sharing Siamese backbone to obtain multi-scale change features of multi-temporal images and tokens containing high-level semantic information of water-body changes, and a temporal differential fusion (TDF) module that integrates semantic tokens and change features to generate flood maps with reduced speckle noise. Specifically, the backbone is split into multiple stages. In each stage, we design three modules, namely, temporal-wise feature extraction (TWFE), cross-temporal change attention (CTCA), and temporal-aware change enhancement (TACE), to effectively extract the change features. In TACE of the last stage, we introduce a class token to record high-level semantic information of water-body changes via the attention mechanism. Another challenge faced by data-driven deep learning algorithms is the limited availability of flood detection datasets. To overcome this, we have created the S1GFloods open-source dataset, a global-scale high-resolution Sentinel-1 SAR image pairs dataset covering 46 global flood events between 2015 and 2022. The experiments on the S1GFloods dataset using the proposed DAM-Net showed top results compared to state-of-the-art methods in terms of overall accuracy, F1-score, and IoU, which reached 97.8%, 96.5%, and 93.2%, respectively. Our dataset and code will be available online at https://github.com/Tamer-Saleh/S1GFlood-Detection.

* 16 pages, 11 figures

Via

Access Paper or Ask Questions