Abstract:Diffeomorphic image registration is crucial for various medical imaging applications because it can preserve the topology of the transformation. This study introduces DCCNN-LSTM-Reg, a learning framework that evolves dynamically and learns a symmetrical registration path by satisfying a specified control increment system. This framework aims to obtain symmetric diffeomorphic deformations between moving and fixed images. To achieve this, we combine deep learning networks with diffeomorphic mathematical mechanisms to create a continuous and dynamic registration architecture, which consists of multiple Symmetric Registration (SR) modules cascaded on five different scales. Specifically, our method first uses two U-nets with shared parameters to extract multiscale feature pyramids from the images. We then develop an SR-module comprising a sequential CNN-LSTM architecture to progressively correct the forward and reverse multiscale deformation fields using control increment learning and the homotopy continuation technique. Through extensive experiments on three 3D registration tasks, we demonstrate that our method outperforms existing approaches in both quantitative and qualitative evaluations.
Abstract:Deep-learning methods have shown promising performance for low-dose computed tomography (LDCT) reconstruction. However, supervised methods face the problem of lacking labeled data in clinical scenarios, and the CNN-based unsupervised denoising methods would cause excessive smoothing in the reconstructed image. Recently, the normalizing flows (NFs) based methods have shown advantages in producing detail-rich images and avoiding over-smoothing, however, there are still issues: (1) Although the alternating optimization in the data and latent space can well utilize the regularization and generation capabilities of NFs, the current two-way transformation strategy of noisy images and latent variables would cause detail loss and secondary artifacts; and (2) Training NFs on high-resolution CT images is hard due to huge computation. Though using conditional normalizing flows (CNFs) to learn conditional probability can reduce the computational burden, current methods require labeled data for conditionalization, and the unsupervised CNFs-based LDCT reconstruction remains a problem. To tackle these problems, we propose a novel CNFs-based unsupervised LDCT iterative reconstruction algorithm. It employs strict one-way transformation when performing alternating optimization in the dual spaces, thus effectively avoiding the problems of detail loss and secondary artifacts. By proposing a novel unsupervised conditionalization strategy, we train CNFs on high-resolution CT images, thus achieving fast and high-quality unsupervised reconstruction. Experiments on different datasets suggest that the performance of the proposed algorithm could surpass some state-of-the-art unsupervised and even supervised methods.
Abstract:Schema and entity matching tasks are crucial for data integration and management. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. In this paper, we present the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a pseudo-code-based task decomposition strategy to adopt task-specific natural language statements that guide LLM reasoning and reduce confusion. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Additionally, we introduce a result-ensembling strategy to leverage multiple knowledge sources and suppress poorly formatted outputs. Comprehensive evaluations on schema and entity matching tasks demonstrate that KcMF outperforms previous non-LLM state-of-the-art (SOTA) methods by an average F1 score of 22.9% and competes effectively with SOTA fine-tuned LLMs. Moreover, KcMF generalizes well across different LLMs.
Abstract:Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds based on the modeled content. Masked autoencoders (MAE) have become the mainstream paradigm in point clouds self-supervised learning. However, existing MAE-based methods are domain-specific, limiting the model's generalization. In this paper, we propose to pre-train a general Point cloud Hybrid-Domain Masked AutoEncoder (PointHDMAE) via a block-to-scene pre-training strategy. We first propose a hybrid-domain masked autoencoder consisting of an encoder and decoder belonging to the scene domain and object domain, respectively. The object domain encoder specializes in handling object point clouds and multiple shared object encoders assist the scene domain encoder in analyzing the scene point clouds. Furthermore, we propose a block-to-scene strategy to pre-train our hybrid-domain model. Specifically, we first randomly select point blocks within a scene and apply a set of transformations to convert each point block coordinates from the scene space to the object space. Then, we employ an object-level mask and reconstruction pipeline to recover the masked points of each block, enabling the object encoder to learn a universal object representation. Finally, we introduce a scene-level block position regression pipeline, which utilizes the blocks' features in the object space to regress these blocks' initial positions within the scene space, facilitating the learning of scene representations. Extensive experiments across different datasets and tasks demonstrate the generalization and superiority of our hybrid-domain model.
Abstract:We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is approximately a rank-two matrix. Empirically, we demonstrate that WD is a necessary condition for inducing this low-rank bias across both regression and classification tasks. Our work differs from previous studies as our theoretical analysis does not rely on common assumptions regarding the training data distribution, optimality of weight matrices, or specific training procedures. Furthermore, by leveraging the low-rank bias, we derive improved generalization error bounds and provide numerical evidence showing that better generalization can be achieved. Thus, our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
Abstract:Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the relational self-supervised learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.
Abstract:In the field of Alzheimer's disease diagnosis, segmentation and classification tasks are inherently interconnected. Sharing knowledge between models for these tasks can significantly improve training efficiency, particularly when training data is scarce. However, traditional knowledge distillation techniques often struggle to bridge the gap between segmentation and classification due to the distinct nature of tasks and different model architectures. To address this challenge, we propose a dual-stream pipeline that facilitates cross-task and cross-architecture knowledge sharing. Our approach introduces a dual-stream embedding module that unifies feature representations from segmentation and classification models, enabling dimensional integration of these features to guide the classification model. We validated our method on multiple 3D datasets for Alzheimer's disease diagnosis, demonstrating significant improvements in classification performance, especially on small datasets. Furthermore, we extended our pipeline with a residual temporal attention mechanism for early diagnosis, utilizing images taken before the atrophy of patients' brain mass. This advancement shows promise in enabling diagnosis approximately six months earlier in mild and asymptomatic stages, offering critical time for intervention.
Abstract:Deep learning (DL) models have shown significant potential in Alzheimer's Disease (AD) classification. However, understanding and interpreting these models remains challenging, which hinders the adoption of these models in clinical practice. Techniques such as saliency maps have been proven effective in providing visual and empirical clues about how these models work, but there still remains a gap in understanding which specific brain regions DL models focus on and whether these brain regions are pathologically associated with AD. To bridge such gap, in this study, we developed a quantitative disease-focusing strategy to first enhance the interpretability of DL models using saliency maps and brain segmentations; then we propose a disease-focus (DF) score that quantifies how much a DL model focuses on brain areas relevant to AD pathology based on clinically known MRI-based pathological regions of AD. Using this strategy, we compared several state-of-the-art DL models, including a baseline 3D ResNet model, a pretrained MedicalNet model, and a MedicalNet with data augmentation to classify patients with AD vs. cognitive normal patients using MRI data; then we evaluated these models in terms of their abilities to focus on disease-relevant regions. Our results show interesting disease-focusing patterns with different models, particularly characteristic patterns with the pretrained models and data augmentation, and also provide insight into their classification performance. These results suggest that the approach we developed for quantitatively assessing the abilities of DL models to focus on disease-relevant regions may help improve interpretability of these models for AD classification and facilitate their adoption for AD diagnosis in clinical practice. The code is publicly available at https://github.com/Liang-lt/ADNI.
Abstract:Diffusion models have shown promising results in cross-modal generation tasks involving audio and music, such as text-to-sound and text-to-music generation. These text-controlled music generation models typically focus on generating music by capturing global musical attributes like genre and mood. However, music composition is a complex, multilayered task that often involves musical arrangement as an integral part of the process. This process involves composing each instrument to align with existing ones in terms of beat, dynamics, harmony, and melody, requiring greater precision and control over tracks than text prompts usually provide. In this work, we address these challenges by extending the MusicLDM, a latent diffusion model for music, into a multi-track generative model. By learning the joint probability of tracks sharing a context, our model is capable of generating music across several tracks that correspond well to each other, either conditionally or unconditionally. Additionally, our model is capable of arrangement generation, where the model can generate any subset of tracks given the others (e.g., generating a piano track complementing given bass and drum tracks). We compared our model with an existing multi-track generative model and demonstrated that our model achieves considerable improvements across objective metrics for both total and arrangement generation tasks.
Abstract:Achieving robust speech separation for overlapping speakers in various acoustic environments with noise and reverberation remains an open challenge. Although existing datasets are available to train separators for specific scenarios, they do not effectively generalize across diverse real-world scenarios. In this paper, we present a novel data simulation pipeline that produces diverse training data from a range of acoustic environments and content, and propose new training paradigms to improve quality of a general speech separation model. Specifically, we first introduce AC-SIM, a data simulation pipeline that incorporates broad variations in both content and acoustics. Then we integrate multiple training objectives into the permutation invariant training (PIT) to enhance separation quality and generalization of the trained model. Finally, we conduct comprehensive objective and human listening experiments across separation architectures and benchmarks to validate our methods, demonstrating substantial improvement of generalization on both non-homologous and real-world test sets.