Abstract:Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.
Abstract:Recent advancements in Remote Sensing (RS) for Change Detection (CD) and Change Captioning (CC) have seen substantial success by adopting deep learning techniques. Despite these advances, existing methods often handle CD and CC tasks independently, leading to inefficiencies from the absence of synergistic processing. In this paper, we present ChangeMinds, a novel unified multi-task framework that concurrently optimizes CD and CC processes within a single, end-to-end model. We propose the change-aware long short-term memory module (ChangeLSTM) to effectively capture complex spatiotemporal dynamics from extracted bi-temporal deep features, enabling the generation of universal change-aware representations that effectively serve both CC and CD tasks. Furthermore, we introduce a multi-task predictor with a cross-attention mechanism that enhances the interaction between image and text features, promoting efficient simultaneous learning and processing for both tasks. Extensive evaluations on the LEVIR-MCI dataset, alongside other standard benchmarks, show that ChangeMinds surpasses existing methods in multi-task learning settings and markedly improves performance in individual CD and CC tasks. Codes and pre-trained models will be available online.
Abstract:In recent years, with the rapid advancement of transformer models, transformer-based multimodal architectures have found wide application in various downstream tasks, including but not limited to Image Captioning, Visual Question Answering (VQA), and Image-Text Generation. However, contemporary approaches to Remote Sensing (RS) VQA often involve resource-intensive techniques, such as full fine-tuning of large models or the extraction of image-text features from pre-trained multimodal models, followed by modality fusion using decoders. These approaches demand significant computational resources and time, and a considerable number of trainable parameters are introduced. To address these challenges, we introduce a novel method known as RSAdapter, which prioritizes runtime and parameter efficiency. RSAdapter comprises two key components: the Parallel Adapter and an additional linear transformation layer inserted after each fully connected (FC) layer within the Adapter. This approach not only improves adaptation to pre-trained multimodal models but also allows the parameters of the linear transformation layer to be integrated into the preceding FC layers during inference, reducing inference costs. To demonstrate the effectiveness of RSAdapter, we conduct an extensive series of experiments using three distinct RS-VQA datasets and achieve state-of-the-art results on all three datasets. The code for RSAdapter will be available online at https://github.com/Y-D-Wang/RSAdapter.