Abstract:Diffeomorphic image registration is crucial for various medical imaging applications because it can preserve the topology of the transformation. This study introduces DCCNN-LSTM-Reg, a learning framework that evolves dynamically and learns a symmetrical registration path by satisfying a specified control increment system. This framework aims to obtain symmetric diffeomorphic deformations between moving and fixed images. To achieve this, we combine deep learning networks with diffeomorphic mathematical mechanisms to create a continuous and dynamic registration architecture, which consists of multiple Symmetric Registration (SR) modules cascaded on five different scales. Specifically, our method first uses two U-nets with shared parameters to extract multiscale feature pyramids from the images. We then develop an SR-module comprising a sequential CNN-LSTM architecture to progressively correct the forward and reverse multiscale deformation fields using control increment learning and the homotopy continuation technique. Through extensive experiments on three 3D registration tasks, we demonstrate that our method outperforms existing approaches in both quantitative and qualitative evaluations.
Abstract:Purpose: To perform image registration and averaging of multiple free-breathing single-shot cardiac images, where the individual images may have a low signal-to-noise ratio (SNR). Methods: To address low SNR encountered in single-shot imaging, especially at low field strengths, we propose a fast deep learning (DL)-based image registration method, called Averaging Morph with Edge Detection (AiM-ED). AiM-ED jointly registers multiple noisy source images to a noisy target image and utilizes a noise-robust pre-trained edge detector to define the training loss. We validate AiM-ED using synthetic late gadolinium enhanced (LGE) imaging data from the MR extended cardiac-torso (MRXCAT) phantom and retrospectively undersampled single-shot data from healthy subjects (24 slices) and patients (5 slices) under various levels of added noise. Additionally, we demonstrate the clinical feasibility of AiM-ED by applying it to prospectively undersampled data from patients (6 slices) scanned at a 0.55T scanner. Results: Compared to a traditional energy-minimization-based image registration method and DL-based VoxelMorph, images registered using AiM-ED exhibit higher values of recovery SNR and three perceptual image quality metrics. An ablation study shows the benefit of both jointly processing multiple source images and using an edge map in AiM-ED. Conclusion: For single-shot LGE imaging, AiM-ED outperforms existing image registration methods in terms of image quality. With fast inference, minimal training data requirements, and robust performance at various noise levels, AiM-ED has the potential to benefit single-shot CMR applications.
Abstract:Large-scale "pre-train and prompt learning" paradigms have demonstrated remarkable adaptability, enabling broad applications across diverse domains such as question answering, image recognition, and multimodal retrieval. This approach fully leverages the potential of large-scale pre-trained models, reducing downstream data requirements and computational costs while enhancing model applicability across various tasks. Graphs, as versatile data structures that capture relationships between entities, play pivotal roles in fields such as social network analysis, recommender systems, and biological graphs. Despite the success of pre-train and prompt learning paradigms in Natural Language Processing (NLP) and Computer Vision (CV), their application in graph domains remains nascent. In graph-structured data, not only do the node and edge features often have disparate distributions, but the topological structures also differ significantly. This diversity in graph data can lead to incompatible patterns or gaps between pre-training and fine-tuning on downstream graphs. We aim to bridge this gap by summarizing methods for alleviating these disparities. This includes exploring prompt design methodologies, comparing related techniques, assessing application scenarios and datasets, and identifying unresolved problems and challenges. This survey categorizes over 100 relevant works in this field, summarizing general design principles and the latest applications, including text-attributed graphs, molecules, proteins, and recommendation systems. Through this extensive review, we provide a foundational understanding of graph prompt learning, aiming to impact not only the graph mining community but also the broader Artificial General Intelligence (AGI) community.
Abstract:In the information retrieval (IR) area, dense retrieval (DR) models use deep learning techniques to encode queries and passages into embedding space to compute their semantic relations. It is important for DR models to balance both efficiency and effectiveness. Pre-trained language models (PLMs), especially Transformer-based PLMs, have been proven to be effective encoders of DR models. However, the self-attention component in Transformer-based PLM results in a computational complexity that grows quadratically with sequence length, and thus exhibits a slow inference speed for long-text retrieval. Some recently proposed non-Transformer PLMs, especially the Mamba architecture PLMs, have demonstrated not only comparable effectiveness to Transformer-based PLMs on generative language tasks but also better efficiency due to linear time scaling in sequence length. This paper implements the Mamba Retriever to explore whether Mamba can serve as an effective and efficient encoder of DR model for IR tasks. We fine-tune the Mamba Retriever on the classic short-text MS MARCO passage ranking dataset and the long-text LoCoV0 dataset. Experimental results show that (1) on the MS MARCO passage ranking dataset and BEIR, the Mamba Retriever achieves comparable or better effectiveness compared to Transformer-based retrieval models, and the effectiveness grows with the size of the Mamba model; (2) on the long-text LoCoV0 dataset, the Mamba Retriever can extend to longer text length than its pre-trained length after fine-tuning on retrieval task, and it has comparable or better effectiveness compared to other long-text retrieval models; (3) the Mamba Retriever has superior inference speed for long-text retrieval. In conclusion, Mamba Retriever is both effective and efficient, making it a practical model, especially for long-text retrieval.
Abstract:Recently, with the rise of web images, managing and understanding large-scale image datasets has become increasingly important. Vision Large Language Models (VLLMs) have recently emerged due to their robust vision-understanding capabilities. However, training these models requires vast amounts of data, posing challenges to efficiency, effectiveness, data quality, and privacy. In this paper, we introduce SynthVLM, a novel data synthesis pipeline for VLLMs. Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions, creating precisely aligned image-text pairs. Leveraging these pairs, we achieve state-of-the-art (SoTA) performance on various vision question answering tasks, maintaining high alignment quality and preserving advanced language abilities. Moreover, SynthVLM surpasses traditional GPT-4 Vision-based caption generation methods in performance while significantly reducing computational overhead. Crucially, our method's reliance on purely generated data ensures the preservation of privacy, achieving SoTA performance with just 100k data points (only 18% of the official dataset size).
Abstract:Graph classification is a critical task in numerous multimedia applications, where graphs are employed to represent diverse types of multimedia data, including images, videos, and social networks. Nevertheless, in real-world scenarios, labeled graph data can be limited or scarce. To address this issue, we focus on the problem of semi-supervised graph classification, which involves both supervised and unsupervised models learning from labeled and unlabeled data. In contrast to recent approaches that transfer the entire knowledge from the unsupervised model to the supervised one, we argue that an effective transfer should only retain the relevant semantics that align well with the supervised task. In this paper, we propose a novel framework named DisenSemi, which learns disentangled representation for semi-supervised graph classification. Specifically, a disentangled graph encoder is proposed to generate factor-wise graph representations for both supervised and unsupervised models. Then we train two models via supervised objective and mutual information (MI)-based constraints respectively. To ensure the meaningful transfer of knowledge from the unsupervised encoder to the supervised one, we further define an MI-based disentangled consistency regularization between two models and identify the corresponding rationale that aligns well with the current graph classification task. Experimental results on a range of publicly accessible datasets reveal the effectiveness of our DisenSemi.
Abstract:Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.
Abstract:Adapting Large Language Models (LLMs) for recommendation requires careful consideration of the decoding process, given the inherent differences between generating items and natural language. Existing approaches often directly apply LLMs' original decoding methods. However, we find these methods encounter significant challenges: 1) amplification bias -- where standard length normalization inflates scores for items containing tokens with generation probabilities close to 1 (termed ghost tokens), and 2) homogeneity issue -- generating multiple similar or repetitive items for a user. To tackle these challenges, we introduce a new decoding approach named Debiasing-Diversifying Decoding (D3). D3 disables length normalization for ghost tokens to alleviate amplification bias, and it incorporates a text-free assistant model to encourage tokens less frequently generated by LLMs for counteracting recommendation homogeneity. Extensive experiments on real-world datasets demonstrate the method's effectiveness in enhancing accuracy and diversity.
Abstract:Nested Named Entity Recognition (NNER) focuses on addressing overlapped entity recognition. Compared to Flat Named Entity Recognition (FNER), annotated resources are scarce in the corpus for NNER. Data augmentation is an effective approach to address the insufficient annotated corpus. However, there is a significant lack of exploration in data augmentation methods for NNER. Due to the presence of nested entities in NNER, existing data augmentation methods cannot be directly applied to NNER tasks. Therefore, in this work, we focus on data augmentation for NNER and resort to more expressive structures, Composited-Nested-Label Classification (CNLC) in which constituents are combined by nested-word and nested-label, to model nested entities. The dataset is augmented using the Composited-Nested-Learning (CNL). In addition, we propose the Confidence Filtering Mechanism (CFM) for a more efficient selection of generated data. Experimental results demonstrate that this approach results in improvements in ACE2004 and ACE2005 and alleviates the impact of sample imbalance.
Abstract:Background: Accelerated real-time cine (RT-Cine) imaging enables cardiac function assessment without the need for breath-holding. However, when performed during in-magnet exercise, RT-Cine images may exhibit significant motion artifacts. Methods: By projecting the time-averaged images to the subspace spanned by the coil sensitivity maps, we propose a coil reweighting (CR) method to automatically suppress a subset of receive coils that introduces a high level of artifacts in the reconstructed image. RT-Cine data collected at rest and during exercise from ten healthy volunteers and six patients were utilized to assess the performance of the proposed method. One short-axis and one two-chamber RT-Cine series reconstructed with and without CR from each subject were visually scored by two cardiologists in terms of the level of artifacts on a scale of 1 (worst) to 5 (best). Results: For healthy volunteers, applying CR to RT-Cine images collected at rest did not significantly change the image quality score (p=1). In contrast, for RT-Cine images collected during exercise, CR significantly improved the score from 3.9 to 4.68 (p<0.001). Similarly, in patients, CR did not significantly change the score for images collected at rest (p=0.031) but markedly improved the score from 3.15 to 4.42 (p<0.001) for images taken during exercise. Despite lower image quality scores in the patient cohort compared to healthy subjects, likely due to larger body habitus and the difficulty of limiting body motion during exercise, CR effectively suppressed motion artifacts, with all image series from the patient cohort receiving a score of four or higher. Conclusion: Using data from healthy subjects and patients, we demonstrate that the motion artifacts in the reconstructed RT-Cine images can be effectively suppressed significantly with the proposed CR method.