Abstract:Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.
Abstract:Change Detection (CD) enables the identification of alterations between images of the same area captured at different times. However, existing CD methods still struggle to address pseudo changes resulting from domain information differences in multi-temporal images and instances of detail errors caused by the loss and contamination of detail features during the upsampling process in the network. To address this, we propose a bi-temporal Gaussian distribution feature-dependent network (BGFD). Specifically, we first introduce the Gaussian noise domain disturbance (GNDD) module, which approximates distribution using image statistical features to characterize domain information, samples noise to perturb the network for learning redundant domain information, addressing domain information differences from a more fundamental perspective. Additionally, within the feature dependency facilitation (FDF) module, we integrate a novel mutual information difference loss ($L_{MI}$) and more sophisticated attention mechanisms to enhance the capabilities of the network, ensuring the acquisition of essential domain information. Subsequently, we have designed a novel detail feature compensation (DFC) module, which compensates for detail feature loss and contamination introduced during the upsampling process from the perspectives of enhancing local features and refining global features. The BGFD has effectively reduced pseudo changes and enhanced the detection capability of detail information. It has also achieved state-of-the-art performance on four publicly available datasets - DSIFN-CD, SYSU-CD, LEVIR-CD, and S2Looking, surpassing baseline models by +8.58%, +1.28%, +0.31%, and +3.76% respectively, in terms of the F1-Score metric.
Abstract:Recommendation models utilizing unique identities (IDs) to represent distinct users and items have dominated the recommender systems literature for over a decade. Since multi-modal content of items (e.g., texts and images) and knowledge graphs (KGs) may reflect the interaction-related users' preferences and items' characteristics, they have been utilized as useful side information to further improve the recommendation quality. However, the success of such methods often limits to either warm-start or strict cold-start item recommendation in which some items neither appear in the training data nor have any interactions in the test stage: (1) Some fail to learn the embedding of a strict cold-start item since side information is only utilized to enhance the warm-start ID representations; (2) The others deteriorate the performance of warm-start recommendation since unrelated multi-modal content or entities in KGs may blur the final representations. In this paper, we propose a unified framework incorporating multi-modal content of items and KGs to effectively solve both strict cold-start and warm-start recommendation termed Firzen, which extracts the user-item collaborative information over frozen heterogeneous graph (collaborative knowledge graph), and exploits the item-item semantic structures and user-user behavioral association over frozen homogeneous graphs (item-item relation graph and user-user co-occurrence graph). Furthermore, we build four unified strict cold-start evaluation benchmarks based on publicly available Amazon datasets and a real-world industrial dataset from Weixin Channels via rearranging the interaction data and constructing KGs. Extensive empirical results demonstrate that our model yields significant improvements for strict cold-start recommendation and outperforms or matches the state-of-the-art performance in the warm-start scenario.
Abstract:Satellite image time series (SITS) data provides continuous observations over time, allowing for the tracking of vegetation changes and growth patterns throughout the seasons and years. Numerous deep learning (DL) approaches using SITS for crop classification have emerged recently, with the latest approaches adopting Transformer for SITS classification. However, the quadratic complexity of self-attention in Transformer poses challenges for classifying long time series. While the cutting-edge Mamba architecture has demonstrated strength in various domains, including remote sensing image interpretation, its capacity to learn temporal representations in SITS data remains unexplored. Moreover, the existing SITS classification methods often depend solely on crop labels as supervision signals, which fails to fully exploit the temporal information. In this paper, we proposed a Satellite Image Time Series Mamba (SITSMamba) method for crop classification based on remote sensing time series data. The proposed SITSMamba contains a spatial encoder based on Convolutional Neural Networks (CNN) and a Mamba-based temporal encoder. To exploit richer temporal information from SITS, we design two branches of decoder used for different tasks. The first branch is a crop Classification Branch (CBranch), which includes a ConvBlock to decode the feature to a crop map. The second branch is a SITS Reconstruction Branch that uses a Linear layer to transform the encoded feature to predict the original input values. Furthermore, we design a Positional Weight (PW) applied to the RBranch to help the model learn rich latent knowledge from SITS. We also design two weighting factors to control the balance of the two branches during training. The code of SITSMamba is available at: https://github.com/XiaoleiQinn/SITSMamba.
Abstract:Privacy protection has always been an ongoing topic, especially for AI. Currently, a low-cost scheme called Machine Unlearning forgets the private data remembered in the model. Specifically, given a private dataset and a trained neural network, we need to use e.g. pruning, fine-tuning, and gradient ascent to remove the influence of the private dataset on the neural network. Inspired by this, we try to use this concept to bridge the gap between the fields of image restoration and security, creating a new research idea. We propose the scene for the All-In-One model (a neural network that restores a wide range of degraded information), where a given dataset such as haze, or rain, is private and needs to be eliminated from the influence of it on the trained model. Notably, we find great challenges in this task to remove the influence of sensitive data while ensuring that the overall model performance remains robust, which is akin to directing a symphony orchestra without specific instruments while keeping the playing soothing. Here we explore a simple but effective approach: Instance-wise Unlearning through the use of adversarial examples and gradient ascent techniques. Our approach is a low-cost solution compared to the strategy of retraining the model from scratch, where the gradient ascent trick forgets the specified data and the performance of the adversarial sample maintenance model is robust. Through extensive experimentation on two popular unified image restoration models, we show that our approach effectively preserves knowledge of remaining data while unlearning a given degradation type.
Abstract:All-in-one image restoration tasks are becoming increasingly important, especially for ultra-high-definition (UHD) images. Existing all-in-one UHD image restoration methods usually boost the model's performance by introducing prompt or customized dynamized networks for different degradation types. For the inference stage, it might be friendly, but in the training stage, since the model encounters multiple degraded images of different quality in an epoch, these cluttered learning objectives might be information pollution for the model. To address this problem, we propose a new training paradigm for general image restoration models, which we name \textbf{Review Learning}, which enables image restoration models to be capable enough to handle multiple types of degradation without prior knowledge and prompts. This approach begins with sequential training of an image restoration model on several degraded datasets, combined with a review mechanism that enhances the image restoration model's memory for several previous classes of degraded datasets. In addition, we design a lightweight all-purpose image restoration network that can efficiently reason about degraded images with 4K ($3840 \times 2160$) resolution on a single consumer-grade GPU.
Abstract:We present REMM, a rotation-equivariant framework for end-to-end multimodal image matching, which fully encodes rotational differences of descriptors in the whole matching pipeline. Previous learning-based methods mainly focus on extracting modal-invariant descriptors, while consistently ignoring the rotational invariance. In this paper, we demonstrate that our REMM is very useful for multimodal image matching, including multimodal feature learning module and cyclic shift module. We first learn modal-invariant features through the multimodal feature learning module. Then, we design the cyclic shift module to rotationally encode the descriptors, greatly improving the performance of rotation-equivariant matching, which makes them robust to any angle. To validate our method, we establish a comprehensive rotation and scale-matching benchmark for evaluating the anti-rotation performance of multimodal images, which contains a combination of multi-angle and multi-scale transformations from four publicly available datasets. Extensive experiments show that our method outperforms existing methods in benchmarking and generalizes well to independent datasets. Additionally, we conducted an in-depth analysis of the key components of the REMM to validate the improvements brought about by the cyclic shift module. Code and dataset at https://github.com/HanNieWHU/REMM.
Abstract:The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics, which receives little attention in previous studies. To bridge the gap, we propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems, namely hit-correction, error-correction, under-correction, and over-correction. They collectively contribute to revealing the critical characteristics and locating drawbacks of GEC systems. Evaluating systems by Combining these dimensions leads to high human consistency over other reference-based and reference-less metrics. Extensive experiments on 2 human judgement datasets and 6 reference datasets demonstrate the effectiveness and robustness of our method. All the codes will be released after the peer review.
Abstract:Synthetic data generation has gained significant attention recently for its utility in training large vision and language models. However, the application of synthetic data to the training of multimodal context-augmented generation systems has been relatively unexplored. This gap in existing work is important because existing vision and language models (VLMs) are not trained specifically for context-augmented generation. Resources for adapting such models are therefore crucial for enabling their use in retrieval-augmented generation (RAG) settings, where a retriever is used to gather relevant information that is then subsequently provided to a generative model via context augmentation. To address this challenging problem, we generate SK-VQA: a large synthetic multimodal dataset containing over 2 million question-answer pairs which require external knowledge to determine the final answer. Our dataset is both larger and significantly more diverse than existing resources of its kind, possessing over 11x more unique questions and containing images from a greater variety of sources than previously-proposed datasets. Through extensive experiments, we demonstrate that our synthetic dataset can not only serve as a challenging benchmark, but is also highly effective for adapting existing generative multimodal models for context-augmented generation.
Abstract:Terahertz (THz) cell-free massive multiple-input-multiple-output (mMIMO) networks have been envisioned as a prospective technology for achieving higher system capacity, improved performance, and ultra-high reliability in 6G networks. However, due to severe attenuation and limited scattering in THz transmission, as well as high power consumption for increased number of access points (APs), further improvement of network capacity becomes challenging. Reconfigurable intelligent surface (RIS) has been introduced as a low-cost solution to reduce AP deployment and assist in data transmission. However, due to the ultra-wide bandwidth and frequency-dependent characteristics of RISs, beam split effect has become an unavoidable obstacle. To compensate the severe performance degradation caused by beam split effect, we introduce additional time delay (TD) layers at both access points (APs) and RISs. Accordingly, we propose a joint precoding framework at APs and RISs to fully unleash the potential of the considered network. Specifically, we first formulate the joint precoding as a non-convex optimization problem. Then, given the location of unchanged RISs, we adjust the time delays (TDs) of APs to align the generated beams towards RISs. After that, with knowledge of the optimal TDs of APs, we decouple the optimization problem into three subproblems of optimizing the baseband beamformers, RISs and TDs of RISs, respectively. Exploiting multidimensional complex quadratic transform, we transform the subproblems into convex forms and solve them under alternate optimizing framework. Numerical results verify that the proposed method can effectively mitigate beam split effect and significantly improve the achievable rate compared with conventional cell-free mMIMO networks.