Abstract:Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.
Abstract:Large Language Models (LLMs) have achieved remarkable success in recent years, owing to their impressive generalization capabilities and rich world knowledge. To capitalize on the potential of using LLMs as recommender systems, mainstream approaches typically focus on two paradigms. The first paradigm designs multi-domain or multi-task instruction data for generalizable recommendation, so as to align LLMs with general recommendation areas and deal with cold-start recommendation. The second paradigm enhances domain-specific recommendation tasks with parameter-efficient fine-tuning techniques, in order to improve models under the warm recommendation scenarios. While most previous works treat these two paradigms separately, we argue that they have complementary advantages, and combining them together would be helpful. To that end, in this paper, we propose a generalizable and efficient LLM-based recommendation framework MoLoRec. Our approach starts by parameter-efficient fine-tuning a domain-general module with general recommendation instruction data, to align LLM with recommendation knowledge. Then, given users' behavior of a specific domain, we construct a domain-specific instruction dataset and apply efficient fine-tuning to the pre-trained LLM. After that, we provide approaches to integrate the above domain-general part and domain-specific part with parameters mixture. Please note that, MoLoRec is efficient with plug and play, as the domain-general module is trained only once, and any domain-specific plug-in can be efficiently merged with only domain-specific fine-tuning. Extensive experiments on multiple datasets under both warm and cold-start recommendation scenarios validate the effectiveness and generality of the proposed MoLoRec.
Abstract:Empowered by semantic-rich content information, multimedia recommendation has emerged as a potent personalized technique. Current endeavors center around harnessing multimedia content to refine item representation or uncovering latent item-item structures based on modality similarity. Despite the effectiveness, we posit that these methods are usually suboptimal due to the introduction of irrelevant multimedia features into recommendation tasks. This stems from the fact that generic multimedia feature extractors, while well-designed for domain-specific tasks, can inadvertently introduce task-irrelevant features, leading to potential misguidance of recommenders. In this work, we propose a denoised multimedia recommendation paradigm via the Information Bottleneck principle (IB). Specifically, we propose a novel Information Bottleneck denoised Multimedia Recommendation (IBMRec) model to tackle the irrelevant feature issue. IBMRec removes task-irrelevant features from both feature and item-item structure perspectives, which are implemented by two-level IB learning modules: feature-level (FIB) and graph-level (GIB). In particular, FIB focuses on learning the minimal yet sufficient multimedia features. This is achieved by maximizing the mutual information between multimedia representation and recommendation tasks, while concurrently minimizing it between multimedia representation and pre-trained multimedia features. Furthermore, GIB is designed to learn the robust item-item graph structure, it refines the item-item graph based on preference affinity, then minimizes the mutual information between the original graph and the refined one. Extensive experiments across three benchmarks validate the effectiveness of our proposed model, showcasing high performance, and applicability to various multimedia recommenders.
Abstract:Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model's contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.
Abstract:Sign Language Production (SLP) aims to generate sign videos corresponding to spoken language sentences, where the conversion of sign Glosses to Poses (G2P) is the key step. Due to the cross-modal semantic gap and the lack of word-action correspondence labels for strong supervision alignment, the SLP suffers huge challenges in linguistics-vision consistency. In this work, we propose a Transformer-based Linguistics-Vision Monotonic Consistent Network (LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment and coarse-grained multimodal semantic consistency in language-visual cues through Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator (MSC). In the CSA, we constrain the implicit alignment between corresponding gloss and pose sequences by computing the cosine similarity association matrix between cross-modal feature sequences (i.e., the order consistency of fine-grained sign glosses and actions). As for MSC, we construct multimodal triplets based on paired and unpaired samples in batch data. By pulling closer the corresponding text-visual pairs and pushing apart the non-corresponding text-visual pairs, we constrain the semantic co-occurrence degree between corresponding gloss and pose sequences (i.e., the semantic consistency of coarse-grained textual sentences and sign videos). Extensive experiments on the popular PHOENIX14T benchmark show that the LVMCN outperforms the state-of-the-art.
Abstract:Sign Language Production (SLP) aims to generate semantically consistent sign videos from textual statements, where the conversion from textual glosses to sign poses (G2P) is a crucial step. Existing G2P methods typically treat sign poses as discrete three-dimensional coordinates and directly fit them, which overlooks the relative positional relationships among joints. To this end, we provide a new perspective, constraining joint associations and gesture details by modeling the limb bones to improve the accuracy and naturalness of the generated poses. In this work, we propose a pioneering iconicity disentangled diffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD incorporates a novel Iconicity Disentanglement (ID) module to bridge the gap between relative positions among joints. The ID module disentangles the conventional 3D joint representation into a 4D bone representation, comprising the 3D spatial direction vector and 1D spatial distance vector between adjacent joints. Additionally, an Attribute Controllable Diffusion (ACD) module is introduced to further constrain joint associations, in which the attribute separation layer aims to separate the bone direction and length attributes, and the attribute control layer is designed to guide the pose generation by leveraging the above attributes. The ACD module utilizes the gloss embeddings as semantic conditions and finally generates sign poses from noise embeddings. Extensive experiments on PHOENIX14T and USTC-CSL datasets validate the effectiveness of our method. The code is available at: https://github.com/NaVi-start/Sign-IDD.
Abstract:Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: \href{https://github.com/hfutml/Calibration-MLLM}{https://github.com/hfutml/Calibration-MLLM}.
Abstract:Score-based Generative Models (SGMs) have demonstrated remarkable generalization abilities, e.g. generating unseen, but natural data. However, the greater the generalization power, the more likely the unintended generalization, and the more dangerous the abuse. Research on moderated generalization in SGMs remains limited. To fill this gap, we first examine the current 'gold standard' in Machine Unlearning (MU), i.e., re-training the model after removing the undesirable training data, and find it does not work in SGMs. Further analysis of score functions reveals that the MU 'gold standard' does not alter the original score function, which explains its ineffectiveness. Based on this insight, we propose the first Moderated Score-based Generative Model (MSGM), which introduces a novel score adjustment strategy that redirects the score function away from undesirable data during the continuous-time stochastic differential equation process. Extensive experimental results demonstrate that MSGM significantly reduces the likelihood of generating undesirable content while preserving high visual quality for normal image generation. Albeit designed for SGMs, MSGM is a general and flexible MU framework that is compatible with diverse diffusion architectures (SGM and DDPM) and training strategies (re-training and fine-tuning), and enables zero-shot transfer of the pre-trained models to downstream tasks, e.g. image inpainting and reconstruction. The code will be shared upon acceptance.
Abstract:Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.
Abstract:Recently, large efforts have been made to design efficient linear-complexity visual Transformers. However, current linear attention models are generally unsuitable to be deployed in resource-constrained mobile devices, due to suffering from either few efficiency gains or significant accuracy drops. In this paper, we propose a new de\textbf{C}oupled du\textbf{A}l-interactive linea\textbf{R} att\textbf{E}ntion (CARE) mechanism, revealing that features' decoupling and interaction can fully unleash the power of linear attention. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies, thereby preserving sufficient local and global information while effectively enhancing the efficiency of models. Then, a dynamic memory unit is employed to maintain critical information along the network pipeline. Moreover, we design a dual interaction module to effectively facilitate interaction between local inductive bias and long-range information as well as among features at different layers. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy. Extensive experiments on ImageNet-1K, COCO, and ADE20K datasets demonstrate the effectiveness of our approach, e.g., achieving $78.4/82.1\%$ top-1 accuracy on ImagegNet-1K at the cost of only $0.7/1.9$ GMACs. Codes will be released on \href{..}{github}.