Abstract:Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.
Abstract:High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion models, particularly Stable Diffusion, have demonstrated remarkable capabilities in synthesizing fine details across various vision tasks. Motivated by this, we propose a novel framework that adapts Stable Diffusion for CT blind super-resolution. We employ a practical degradation model to synthesize realistic low-quality images and leverage a pre-trained vision-language model to generate corresponding descriptions. Subsequently, we perform super-resolution using Stable Diffusion with a specialized controlling strategy, conditioned on both low-resolution inputs and the generated text descriptions. Extensive experiments show that our method outperforms existing approaches, demonstrating its potential for achieving high-quality CT imaging at reduced radiation doses. Our code will be made publicly available.
Abstract:Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.
Abstract:Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at https://github.com/MedAITech/TCMN.
Abstract:Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.
Abstract:Cell counting in microscopy images is vital in medicine and biology but extremely tedious and time-consuming to perform manually. While automated methods have advanced in recent years, state-of-the-art approaches tend to increasingly complex model designs. In this paper, we propose a conceptually simple yet effective decoupled learning scheme for automated cell counting, consisting of separate counter and localizer networks. In contrast to jointly learning counting and density map estimation, we show that decoupling these objectives surprisingly improves results. The counter operates on intermediate feature maps rather than pixel space to leverage global context and produce count estimates, while also generating coarse density maps. The localizer then reconstructs high-resolution density maps that precisely localize individual cells, conditional on the original images and coarse density maps from the counter. Besides, to boost counting accuracy, we further introduce a global message passing module to integrate cross-region patterns. Extensive experiments on four datasets demonstrate that our approach, despite its simplicity, challenges common practice and achieves state-of-the-art performance by significant margins. Our key insight is that decoupled learning alleviates the need to learn counting on high-resolution density maps directly, allowing the model to focus on global features critical for accurate estimates. Code is available at https://github.com/MedAITech/DCL.
Abstract:Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. Code is available at https://github.com/MedAITech/SCRD4AD.
Abstract:Unsupervised anomaly detection using only normal samples is of great significance for quality inspection in industrial manufacturing. Although existing reconstruction-based methods have achieved promising results, they still face two problems: poor distinguishable information in image reconstruction and well abnormal regeneration caused by model over-generalization ability. To overcome the above issues, we convert the image reconstruction into a combination of parallel feature restorations and propose a multi-feature reconstruction network, MFRNet, using crossed-mask restoration in this paper. Specifically, a multi-scale feature aggregator is first developed to generate more discriminative hierarchical representations of the input images from a pre-trained model. Subsequently, a crossed-mask generator is adopted to randomly cover the extracted feature map, followed by a restoration network based on the transformer structure for high-quality repair of the missing regions. Finally, a hybrid loss is equipped to guide model training and anomaly estimation, which gives consideration to both the pixel and structural similarity. Extensive experiments show that our method is highly competitive with or significantly outperforms other state-of-the-arts on four public available datasets and one self-made dataset.
Abstract:This paper studies enhanced dense code multiple access (DCMA) system design for downlink transmission over the Nakagami-$m$ fading channels. By studying the DCMA pairwise error probability (PEP) in a Nakagami-$m$ channel, a novel design metric called minimum logarithmic sum distance (MLSD) is first derived. With respect to the proposed MLSD, we introduce a new family of power-imbalanced dense codebooks by deleting certain rows of a special non-unimodular circulant matrix. Simulation results demonstrate that our proposed dense codebooks lead to both larger minimum Euclidean distance and MLSD, thus yielding significant improvements of error performance over the existing sparse code multiple access and conventional unimodular DCMA schemes in Nakagami-$m$ fading channels under different overloading factors.
Abstract:Owing to the more significant set size properties as compared to the set of complete complementary codes (CCCs), quasi-complementary code sets (QCCSs) are more convenient to support a large number of users in multicarrier code-division multiple-access (MC-CDMA) system over CCCs. Besides set size, it is also desirable to have a low maximum aperiodic correlation magnitude and small alphabet size. This paper aims to construct asymptotically optimal and near-optimal aperiodic QCCSs having a small alphabet size and low maximum correlation magnitude. Using multivariate functions and its associated graph, we propose a family of QCCSs consisting of multiple sets of CCCs and determine the parameters of the proposed QCCSs. Unlike the existing constructions of aperiodic QCCSs, the proposed construction can maintain a small alphabet size irrespective of the increasing sequence length and large set size.