Abstract:Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models.
Abstract:Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions -- such as instruction following -- becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model's instruction-following capabilities. Furthermore, to assess performance improvements in all-modality models after post-training alignment, we construct a challenging all-modality capability evaluation framework -- eval-anything. All data, models, and code frameworks have been open-sourced for the community. For more details, please refer to https://github.com/PKU-Alignment/align-anything.
Abstract:Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer's concerns but also improve their work. This raises the critical question of how to enhance authors' comprehension of review comments. In this paper, we present SEAGraph, a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the author's thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors.
Abstract:Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way--learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.
Abstract:Purpose: To develop an MRI technique for free-breathing 3D whole-liver quantification of water T1, water T2, proton density fat fraction (PDFF), R2*. Methods: An Eight-echo spoiled gradient echo pulse sequence with spiral readout was developed by interleaving inversion recovery and T2 magnetization preparation. We propose a neural network based on a 4D and a 3D implicit neural representation (INR) which simultaneously learns the motion deformation fields and the static reference frame MRI subspace images respectively. Water and fat singular images were separated during network training, with no need of performing retrospective water-fat separation. T1, T2, R2* and proton density fat fraction (PDFF) produced by the proposed method were validated in vivo on 10 healthy subjects, using quantitative maps generated from conventional scans as reference. Results: Our results showed minimal bias and narrow 95% limits of agreement on T1, T2, R2* and PDFF values in the liver compared to conventional breath-holding scans. Conclusions: INR-MRF enabled co-registered 3D whole liver T1, T2, R2* and PDFF mapping in a single free-breathing scan.
Abstract:In chronic liver disease, liver fibrosis develops as excessive deposition of extracellular matrix macromolecules, predominantly collagens, progressively form fibrous scars that disrupt the hepatic architecture, and fibrosis, iron, and fat are interrelated. Fibrosis is the best predictor of morbidity and mortality in chronic liver disease but liver biopsy, the reference method for diagnosis and staging, is invasive and limited by sampling and interobserver variability and risks of complications. The overall objective of this study was to develop a new non-invasive method to quantify fibrosis using diamagnetic susceptibility sources with histology validation in ex vivo liver explants.
Abstract:Graph clustering, a classical task in graph learning, involves partitioning the nodes of a graph into distinct clusters. This task has applications in various real-world scenarios, such as anomaly detection, social network analysis, and community discovery. Current graph clustering methods commonly rely on module pre-training to obtain a reliable prior distribution for the model, which is then used as the optimization objective. However, these methods often overlook deeper supervised signals, leading to sub-optimal reliability of the prior distribution. To address this issue, we propose a novel deep graph clustering method called CGCN. Our approach introduces contrastive signals and deep structural information into the pre-training process. Specifically, CGCN utilizes a contrastive learning mechanism to foster information interoperability among multiple modules and allows the model to adaptively adjust the degree of information aggregation for different order structures. Our CGCN method has been experimentally validated on multiple real-world graph datasets, showcasing its ability to boost the dependability of prior clustering distributions acquired through pre-training. As a result, we observed notable enhancements in the performance of the model.
Abstract:Purpose: To develop a pipeline for motion artifact correction in mGRE and quantitative susceptibility mapping (QSM). Methods: Deep learning is integrated with autofocus to improve motion artifact suppression, which is applied QSM of patients with Parkinson's disease (PD). The estimation of affine motion parameters in the autofocus method depends on signal-to-noise ratio and lacks accuracy when data sampling occurs outside the k-space center. A deep learning strategy is employed to remove the residual motion artifacts in autofocus. Results: Results obtained in simulated brain data (n =15) with reference truth show that the proposed autofocus deep learning method significantly improves the image quality of mGRE and QSM (p = 0.001 for SSIM, p < 0.0001 for PSNR and RMSE). Results from 10 PD patients with real motion artifacts in QSM have also been corrected using the proposed method and sent to an experienced radiologist for image quality evaluation, and the average image quality score has increased (p=0.0039). Conclusions: The proposed method enables substantial suppression of motion artifacts in mGRE and QSM.
Abstract:In-context learning (ICL), which promotes inference with several demonstrations, has become a widespread paradigm to stimulate LLM capabilities for downstream tasks. Due to context length constraints, it cannot be further improved in spite of more training data, and general features directly from LLMs in ICL are not adaptive to the specific downstream task. In this paper, we propose a feature-adaptive and data-scalable in-context learning framework (FADS-ICL), which can leverage task-adaptive features to promote inference on the downstream task, with the supervision of beyond-context samples. Specifically, it first extracts general features of beyond-context samples via the LLM with ICL input form one by one, and introduces a task-specific modulator to perform feature refinement and prediction after fitting a specific downstream task. We conduct extensive experiments on FADS-ICL under varying data settings (4$\sim$128 shots) and LLM scale (0.8$\sim$70B) settings. Experimental results show that FADS-ICL consistently outperforms previous state-of-the-art methods by a significant margin under all settings, verifying the effectiveness and superiority of FADS-ICL. For example, under the 1.5B and 32 shots setting, FADS-ICL can achieve \textbf{+14.3} average accuracy from feature adaptation over vanilla ICL on 10 datasets, with \textbf{+6.2} average accuracy over the previous state-of-the-art method, and the performance can further improve with increasing training data. Code and data are publicly available at \url{https://github.com/jiahaozhenbang/FADS-ICL}.
Abstract:We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT