Abstract:Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear whether this adaptation compromises their original language reasoning capabilities. In this work, we explore the effects of multimodal instruction tuning on language reasoning performance. We focus on LLaVA, a leading multimodal framework that integrates LLMs such as Vicuna or Mistral with the CLIP vision encoder. We compare the performance of the original LLMs with their multimodal-adapted counterparts across eight language reasoning tasks. Our experiments yield several key insights. First, the impact of multimodal learning varies between Vicuna and Mistral: we observe a degradation in language reasoning for Mistral but improvements for Vicuna across most tasks. Second, while multimodal instruction learning consistently degrades performance on mathematical reasoning tasks (e.g., GSM8K), it enhances performance on commonsense reasoning tasks (e.g., CommonsenseQA). Finally, we demonstrate that a training-free model merging technique can effectively mitigate the language reasoning degradation observed in multimodal-adapted Mistral and even improve performance on visual tasks.
Abstract:Ultra-high-definition (UHD) image restoration is vital for applications demanding exceptional visual fidelity, yet existing methods often face a trade-off between restoration quality and efficiency, limiting their practical deployment. In this paper, we propose TSFormer, an all-in-one framework that integrates \textbf{T}rusted learning with \textbf{S}parsification to boost both generalization capability and computational efficiency in UHD image restoration. The key is that only a small amount of token movement is allowed within the model. To efficiently filter tokens, we use Min-$p$ with random matrix theory to quantify the uncertainty of tokens, thereby improving the robustness of the model. Our model can run a 4K image in real time (40fps) with 3.38 M parameters. Extensive experiments demonstrate that TSFormer achieves state-of-the-art restoration quality while enhancing generalization and reducing computational demands. In addition, our token filtering method can be applied to other image restoration models to effectively accelerate inference and maintain performance.
Abstract:Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.
Abstract:Change Detection (CD) enables the identification of alterations between images of the same area captured at different times. However, existing CD methods still struggle to address pseudo changes resulting from domain information differences in multi-temporal images and instances of detail errors caused by the loss and contamination of detail features during the upsampling process in the network. To address this, we propose a bi-temporal Gaussian distribution feature-dependent network (BGFD). Specifically, we first introduce the Gaussian noise domain disturbance (GNDD) module, which approximates distribution using image statistical features to characterize domain information, samples noise to perturb the network for learning redundant domain information, addressing domain information differences from a more fundamental perspective. Additionally, within the feature dependency facilitation (FDF) module, we integrate a novel mutual information difference loss ($L_{MI}$) and more sophisticated attention mechanisms to enhance the capabilities of the network, ensuring the acquisition of essential domain information. Subsequently, we have designed a novel detail feature compensation (DFC) module, which compensates for detail feature loss and contamination introduced during the upsampling process from the perspectives of enhancing local features and refining global features. The BGFD has effectively reduced pseudo changes and enhanced the detection capability of detail information. It has also achieved state-of-the-art performance on four publicly available datasets - DSIFN-CD, SYSU-CD, LEVIR-CD, and S2Looking, surpassing baseline models by +8.58%, +1.28%, +0.31%, and +3.76% respectively, in terms of the F1-Score metric.
Abstract:Recommendation models utilizing unique identities (IDs) to represent distinct users and items have dominated the recommender systems literature for over a decade. Since multi-modal content of items (e.g., texts and images) and knowledge graphs (KGs) may reflect the interaction-related users' preferences and items' characteristics, they have been utilized as useful side information to further improve the recommendation quality. However, the success of such methods often limits to either warm-start or strict cold-start item recommendation in which some items neither appear in the training data nor have any interactions in the test stage: (1) Some fail to learn the embedding of a strict cold-start item since side information is only utilized to enhance the warm-start ID representations; (2) The others deteriorate the performance of warm-start recommendation since unrelated multi-modal content or entities in KGs may blur the final representations. In this paper, we propose a unified framework incorporating multi-modal content of items and KGs to effectively solve both strict cold-start and warm-start recommendation termed Firzen, which extracts the user-item collaborative information over frozen heterogeneous graph (collaborative knowledge graph), and exploits the item-item semantic structures and user-user behavioral association over frozen homogeneous graphs (item-item relation graph and user-user co-occurrence graph). Furthermore, we build four unified strict cold-start evaluation benchmarks based on publicly available Amazon datasets and a real-world industrial dataset from Weixin Channels via rearranging the interaction data and constructing KGs. Extensive empirical results demonstrate that our model yields significant improvements for strict cold-start recommendation and outperforms or matches the state-of-the-art performance in the warm-start scenario.
Abstract:Satellite image time series (SITS) data provides continuous observations over time, allowing for the tracking of vegetation changes and growth patterns throughout the seasons and years. Numerous deep learning (DL) approaches using SITS for crop classification have emerged recently, with the latest approaches adopting Transformer for SITS classification. However, the quadratic complexity of self-attention in Transformer poses challenges for classifying long time series. While the cutting-edge Mamba architecture has demonstrated strength in various domains, including remote sensing image interpretation, its capacity to learn temporal representations in SITS data remains unexplored. Moreover, the existing SITS classification methods often depend solely on crop labels as supervision signals, which fails to fully exploit the temporal information. In this paper, we proposed a Satellite Image Time Series Mamba (SITSMamba) method for crop classification based on remote sensing time series data. The proposed SITSMamba contains a spatial encoder based on Convolutional Neural Networks (CNN) and a Mamba-based temporal encoder. To exploit richer temporal information from SITS, we design two branches of decoder used for different tasks. The first branch is a crop Classification Branch (CBranch), which includes a ConvBlock to decode the feature to a crop map. The second branch is a SITS Reconstruction Branch that uses a Linear layer to transform the encoded feature to predict the original input values. Furthermore, we design a Positional Weight (PW) applied to the RBranch to help the model learn rich latent knowledge from SITS. We also design two weighting factors to control the balance of the two branches during training. The code of SITSMamba is available at: https://github.com/XiaoleiQinn/SITSMamba.
Abstract:Privacy protection has always been an ongoing topic, especially for AI. Currently, a low-cost scheme called Machine Unlearning forgets the private data remembered in the model. Specifically, given a private dataset and a trained neural network, we need to use e.g. pruning, fine-tuning, and gradient ascent to remove the influence of the private dataset on the neural network. Inspired by this, we try to use this concept to bridge the gap between the fields of image restoration and security, creating a new research idea. We propose the scene for the All-In-One model (a neural network that restores a wide range of degraded information), where a given dataset such as haze, or rain, is private and needs to be eliminated from the influence of it on the trained model. Notably, we find great challenges in this task to remove the influence of sensitive data while ensuring that the overall model performance remains robust, which is akin to directing a symphony orchestra without specific instruments while keeping the playing soothing. Here we explore a simple but effective approach: Instance-wise Unlearning through the use of adversarial examples and gradient ascent techniques. Our approach is a low-cost solution compared to the strategy of retraining the model from scratch, where the gradient ascent trick forgets the specified data and the performance of the adversarial sample maintenance model is robust. Through extensive experimentation on two popular unified image restoration models, we show that our approach effectively preserves knowledge of remaining data while unlearning a given degradation type.
Abstract:All-in-one image restoration tasks are becoming increasingly important, especially for ultra-high-definition (UHD) images. Existing all-in-one UHD image restoration methods usually boost the model's performance by introducing prompt or customized dynamized networks for different degradation types. For the inference stage, it might be friendly, but in the training stage, since the model encounters multiple degraded images of different quality in an epoch, these cluttered learning objectives might be information pollution for the model. To address this problem, we propose a new training paradigm for general image restoration models, which we name \textbf{Review Learning}, which enables image restoration models to be capable enough to handle multiple types of degradation without prior knowledge and prompts. This approach begins with sequential training of an image restoration model on several degraded datasets, combined with a review mechanism that enhances the image restoration model's memory for several previous classes of degraded datasets. In addition, we design a lightweight all-purpose image restoration network that can efficiently reason about degraded images with 4K ($3840 \times 2160$) resolution on a single consumer-grade GPU.
Abstract:We present REMM, a rotation-equivariant framework for end-to-end multimodal image matching, which fully encodes rotational differences of descriptors in the whole matching pipeline. Previous learning-based methods mainly focus on extracting modal-invariant descriptors, while consistently ignoring the rotational invariance. In this paper, we demonstrate that our REMM is very useful for multimodal image matching, including multimodal feature learning module and cyclic shift module. We first learn modal-invariant features through the multimodal feature learning module. Then, we design the cyclic shift module to rotationally encode the descriptors, greatly improving the performance of rotation-equivariant matching, which makes them robust to any angle. To validate our method, we establish a comprehensive rotation and scale-matching benchmark for evaluating the anti-rotation performance of multimodal images, which contains a combination of multi-angle and multi-scale transformations from four publicly available datasets. Extensive experiments show that our method outperforms existing methods in benchmarking and generalizes well to independent datasets. Additionally, we conducted an in-depth analysis of the key components of the REMM to validate the improvements brought about by the cyclic shift module. Code and dataset at https://github.com/HanNieWHU/REMM.
Abstract:The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics, which receives little attention in previous studies. To bridge the gap, we propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems, namely hit-correction, error-correction, under-correction, and over-correction. They collectively contribute to revealing the critical characteristics and locating drawbacks of GEC systems. Evaluating systems by Combining these dimensions leads to high human consistency over other reference-based and reference-less metrics. Extensive experiments on 2 human judgement datasets and 6 reference datasets demonstrate the effectiveness and robustness of our method. All the codes will be released after the peer review.