Abstract:Graph Contrastive Learning (GCL) has emerged as the foremost approach for self-supervised learning on graph-structured data. GCL reduces reliance on labeled data by learning robust representations from various augmented views. However, existing GCL methods typically depend on consistent stochastic augmentations, which overlook their impact on the intrinsic structure of the spectral domain, thereby limiting the model's ability to generalize effectively. To address these limitations, we propose a novel paradigm called AS-GCL that incorporates asymmetric spectral augmentation for graph contrastive learning. A typical GCL framework consists of three key components: graph data augmentation, view encoding, and contrastive loss. Our method introduces significant enhancements to each of these components. Specifically, for data augmentation, we apply spectral-based augmentation to minimize spectral variations, strengthen structural invariance, and reduce noise. With respect to encoding, we employ parameter-sharing encoders with distinct diffusion operators to generate diverse, noise-resistant graph views. For contrastive loss, we introduce an upper-bound loss function that promotes generalization by maintaining a balanced distribution of intra- and inter-class distance. To our knowledge, we are the first to encode augmentation views of the spectral domain using asymmetric encoders. Extensive experiments on eight benchmark datasets across various node-level tasks demonstrate the advantages of the proposed method.
Abstract:Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.
Abstract:Map construction task plays a vital role in providing precise and comprehensive static environmental information essential for autonomous driving systems. Primary sensors include cameras and LiDAR, with configurations varying between camera-only, LiDAR-only, or camera-LiDAR fusion, based on cost-performance considerations. While fusion-based methods typically perform best, existing approaches often neglect modality interaction and rely on simple fusion strategies, which suffer from the problems of misalignment and information loss. To address these issues, we propose MapFusion, a novel multi-modal Bird's-Eye View (BEV) feature fusion method for map construction. Specifically, to solve the semantic misalignment problem between camera and LiDAR BEV features, we introduce the Cross-modal Interaction Transform (CIT) module, enabling interaction between two BEV feature spaces and enhancing feature representation through a self-attention mechanism. Additionally, we propose an effective Dual Dynamic Fusion (DDF) module to adaptively select valuable information from different modalities, which can take full advantage of the inherent information between different modalities. Moreover, MapFusion is designed to be simple and plug-and-play, easily integrated into existing pipelines. We evaluate MapFusion on two map construction tasks, including High-definition (HD) map and BEV map segmentation, to show its versatility and effectiveness. Compared with the state-of-the-art methods, MapFusion achieves 3.6% and 6.2% absolute improvements on the HD map construction and BEV map segmentation tasks on the nuScenes dataset, respectively, demonstrating the superiority of our approach.
Abstract:Multi-sensor fusion models play a crucial role in autonomous driving perception, particularly in tasks like 3D object detection and HD map construction. These models provide essential and comprehensive static environmental information for autonomous driving systems. While camera-LiDAR fusion methods have shown promising results by integrating data from both modalities, they often depend on complete sensor inputs. This reliance can lead to low robustness and potential failures when sensors are corrupted or missing, raising significant safety concerns. To tackle this challenge, we introduce the Multi-Sensor Corruption Benchmark (MSC-Bench), the first comprehensive benchmark aimed at evaluating the robustness of multi-sensor autonomous driving perception models against various sensor corruptions. Our benchmark includes 16 combinations of corruption types that disrupt both camera and LiDAR inputs, either individually or concurrently. Extensive evaluations of six 3D object detection models and four HD map construction models reveal substantial performance degradation under adverse weather conditions and sensor failures, underscoring critical safety issues. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.
Abstract:Deep hashing has been widely used for large-scale approximate nearest neighbor search due to its storage and search efficiency. However, existing deep hashing methods predominantly rely on abundant training data, leaving the more challenging scenario of low-resource adaptation for deep hashing relatively underexplored. This setting involves adapting pre-trained models to downstream tasks with only an extremely small number of training samples available. Our preliminary benchmarks reveal that current methods suffer significant performance degradation due to the distribution shift caused by limited training samples. To address these challenges, we introduce Class-Calibration LoRA (CLoRA), a novel plug-and-play approach that dynamically constructs low-rank adaptation matrices by leveraging class-level textual knowledge embeddings. CLoRA effectively incorporates prior class knowledge as anchors, enabling parameter-efficient fine-tuning while maintaining the original data distribution. Furthermore, we propose Knowledge-Guided Discrete Optimization (KIDDO), a framework to utilize class knowledge to compensate for the scarcity of visual information and enhance the discriminability of hash codes. Extensive experiments demonstrate that our proposed method, Knowledge- Anchored Low-Resource Adaptation Hashing (KALAHash), significantly boosts retrieval performance and achieves a 4x data efficiency in low-resource scenarios.
Abstract:Federated graph learning (FGL) has gained significant attention for enabling heterogeneous clients to process their private graph data locally while interacting with a centralized server, thus maintaining privacy. However, graph data on clients are typically non-IID, posing a challenge for a single model to perform well across all clients. Another major bottleneck of FGL is the high cost of communication. To address these challenges, we propose a communication-efficient personalized federated graph learning algorithm, CEFGL. Our method decomposes the model parameters into low-rank generic and sparse private models. We employ a dual-channel encoder to learn sparse local knowledge in a personalized manner and low-rank global knowledge in a shared manner. Additionally, we perform multiple local stochastic gradient descent iterations between communication phases and integrate efficient compression techniques into the algorithm. The advantage of CEFGL lies in its ability to capture common and individual knowledge more precisely. By utilizing low-rank and sparse parameters along with compression techniques, CEFGL significantly reduces communication complexity. Extensive experiments demonstrate that our method achieves optimal classification accuracy in a variety of heterogeneous environments across sixteen datasets. Specifically, compared to the state-of-the-art method FedStar, the proposed method (with GIN as the base model) improves accuracy by 5.64\% on cross-datasets setting CHEM, reduces communication bits by a factor of 18.58, and reduces the communication time by a factor of 1.65.
Abstract:Multi-view contrastive clustering (MVCC) has gained significant attention for generating consistent clustering structures from multiple views through contrastive learning. However, most existing MVCC methods create cross-views by combining any two views, leading to a high volume of unreliable pairs. Furthermore, these approaches often overlook discrepancies in multi-view representations, resulting in representation degeneration. To address these challenges, we introduce a novel model called Dual-Weighted Contrastive Learning (DWCL) for Multi-View Clustering. Specifically, to reduce the impact of unreliable cross-views, we introduce an innovative Best-Other (B-O) contrastive mechanism that enhances the representation of individual views at a low computational cost. Furthermore, we develop a dual weighting strategy that combines a view quality weight, reflecting the quality of each view, with a view discrepancy weight. This approach effectively mitigates representation degeneration by downplaying cross-views that are both low in quality and high in discrepancy. We theoretically validate the efficiency of the B-O contrastive mechanism and the effectiveness of the dual weighting strategy. Extensive experiments demonstrate that DWCL outperforms previous methods across eight multi-view datasets, showcasing superior performance and robustness in MVCC. Specifically, our method achieves absolute accuracy improvements of 5.4\% and 5.6\% compared to state-of-the-art methods on the Caltech6V7 and MSRCv1 datasets, respectively.
Abstract:Multi-Image Super-Resolution (MISR) is a crucial yet challenging research task in the remote sensing community. In this paper, we address the challenging task of Multi-Image Super-Resolution in Remote Sensing (MISR-RS), aiming to generate a High-Resolution (HR) image from multiple Low-Resolution (LR) images obtained by satellites. Recently, the weak temporal correlations among LR images have attracted increasing attention in the MISR-RS task. However, existing MISR methods treat the LR images as sequences with strong temporal correlations, overlooking spatial correlations and imposing temporal dependencies. To address this problem, we propose a novel end-to-end framework named Enhancing Spatial Correlations in MISR (ESC-MISR), which fully exploits the spatial-temporal relations of multiple images for HR image reconstruction. Specifically, we first introduce a novel fusion module named Multi-Image Spatial Transformer (MIST), which emphasizes parts with clearer global spatial features and enhances the spatial correlations between LR images. Besides, we perform a random shuffle strategy for the sequential inputs of LR images to attenuate temporal dependencies and capture weak temporal correlations in the training stage. Compared with the state-of-the-art methods, our ESC-MISR achieves 0.70dB and 0.76dB cPSNR improvements on the two bands of the PROBA-V dataset respectively, demonstrating the superiority of our method.
Abstract:Continual graph learning (CGL) is an important and challenging task that aims to extend static GNNs to dynamic task flow scenarios. As one of the mainstream CGL methods, the experience replay (ER) method receives widespread attention due to its superior performance. However, existing ER methods focus on identifying samples by feature significance or topological relevance, which limits their utilization of comprehensive graph data. In addition, the topology-based ER methods only consider local topological information and add neighboring nodes to the buffer, which ignores the global topological information and increases memory overhead. To bridge these gaps, we propose a novel method called Feature-Topology Fusion-based Experience Replay (FTF-ER) to effectively mitigate the catastrophic forgetting issue with enhanced efficiency. Specifically, from an overall perspective to maximize the utilization of the entire graph data, we propose a highly complementary approach including both feature and global topological information, which can significantly improve the effectiveness of the sampled nodes. Moreover, to further utilize global topological information, we propose Hodge Potential Score (HPS) as a novel module to calculate the topological importance of nodes. HPS derives a global node ranking via Hodge decomposition on graphs, providing more accurate global topological information compared to neighbor sampling. By excluding neighbor sampling, HPS significantly reduces buffer storage costs for acquiring topological information and simultaneously decreases training time. Compared with state-of-the-art methods, FTF-ER achieves a significant improvement of 3.6% in AA and 7.1% in AF on the OGB-Arxiv dataset, demonstrating its superior performance in the class-incremental learning setting.
Abstract:Scene Graph Generation (SGG) remains a challenging task due to its compositional property. Previous approaches improve prediction efficiency by learning in an end-to-end manner. However, these methods exhibit limited performance as they assume unidirectional conditioning between entities and predicates, leading to insufficient information interaction. To address this limitation, we propose a novel bidirectional conditioning factorization for SGG, introducing efficient interaction between entities and predicates. Specifically, we develop an end-to-end scene graph generation model, Bidirectional Conditioning Transformer (BCTR), to implement our factorization. BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) facilitates multi-stage interactive feature augmentation between entities and predicates, enabling mutual benefits between the two predictions. Second, Random Feature Alignment (RFA) regularizes the feature space by distilling multi-modal knowledge from pre-trained models, enhancing BCTR's ability on tailed categories without relying on statistical priors. We conduct a series of experiments on Visual Genome and Open Image V6, demonstrating that BCTR achieves state-of-the-art performance on both benchmarks. The code will be available upon acceptance of the paper.