Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhihao Chen

FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising

Aug 24, 2025

Zhihao Chen, Qi Gao, Zilong Li, Junping Zhang, Yi Zhang, Jun Zhao, Hongming Shan

Abstract:Low-dose computed tomography (CT) denoising is crucial for reduced radiation exposure while ensuring diagnostically acceptable image quality. Despite significant advancements driven by deep learning (DL) in recent years, existing DL-based methods, typically trained on a specific dose level and anatomical region, struggle to handle diverse noise characteristics and anatomical heterogeneity during varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. In this paper, we propose FoundDiff, a foundational diffusion model for unified and generalizable LDCT denoising across various dose levels and anatomical regions. FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising. First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception by leveraging specialized contrastive learning strategies to learn continuous representations that quantify ordinal dose variations and identify salient anatomical regions. Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising by synergistically integrating the learned dose and anatomy embeddings from DACLIP into diffusion process via a novel dose and anatomy conditional block (DACB) based on Mamba. Extensive experiments on two public LDCT datasets encompassing eight dose levels and three anatomical regions demonstrate superior denoising performance of FoundDiff over existing state-of-the-art methods and the remarkable generalization to unseen dose levels. The codes and models are available at https://github.com/hao1635/FoundDiff.

* 10 pages, 7 figures

Via

Access Paper or Ask Questions

LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models

Jul 08, 2025

Zhihao Chen, Tao Chen, Chenhui Wang, Qi Gao, Huidong Xie, Chuang Niu, Ge Wang, Hongming Shan

Abstract:Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on https://github.com/hao1635/LangMamba.

* 11 pages, 8 figures

Via

Access Paper or Ask Questions

Quantitative Analysis of Performance Drop in DeepSeek Model Quantization

May 05, 2025

Enbo Zhao, Yi Shen, Shuming Shi, Jieyun Huang, Zhihao Chen, Ning Wang, Siqi Xiao, Jian Zhang, Kai Wang, Shiguo Lian

Abstract:Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3\_K\_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.

Via

Access Paper or Ask Questions

Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

Mar 15, 2025

Shun Zou, Yi Zou, Mingya Zhang, Shipeng Luo, Zhihao Chen, Guangwei Gao

Abstract:In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at https://zs1314.github.io/Fraesormer.

* 6 pages, 4 figures

Via

Access Paper or Ask Questions

Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

Feb 28, 2025

Tao Chen, Chenhui Wang, Zhihao Chen, Hongming Shan

Figure 1 for Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

Figure 2 for Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

Figure 3 for Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

Figure 4 for Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

Abstract:While deep learning has significantly advanced medical image segmentation, most existing methods still struggle with handling complex anatomical regions. Cascaded or deep supervision-based approaches attempt to address this challenge through multi-scale feature learning but fail to establish sufficient inter-scale dependencies, as each scale relies solely on the features of the immediate predecessor. To this end, we propose the AutoRegressive Segmentation framework via next-scale mask prediction, termed AR-Seg, which progressively predicts the next-scale mask by explicitly modeling dependencies across all previous scales within a unified architecture. AR-Seg introduces three innovations: (1) a multi-scale mask autoencoder that quantizes the mask into multi-scale token maps to capture hierarchical anatomical structures, (2) a next-scale autoregressive mechanism that progressively predicts next-scale masks to enable sufficient inter-scale dependencies, and (3) a consensus-aggregation strategy that combines multiple sampled results to generate a more accurate mask, further improving segmentation robustness. Extensive experimental results on two benchmark datasets with different modalities demonstrate that AR-Seg outperforms state-of-the-art methods while explicitly visualizing the intermediate coarse-to-fine segmentation process.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

Spectral Enhancement and Pseudo-Anchor Guidance for Infrared-Visible Person Re-Identification

Dec 26, 2024

Yiyuan Ge, Zhihao Chen, Ziyang Wang, Jiaju Kang, Mingya Zhang

Figure 1 for Spectral Enhancement and Pseudo-Anchor Guidance for Infrared-Visible Person Re-Identification

Figure 2 for Spectral Enhancement and Pseudo-Anchor Guidance for Infrared-Visible Person Re-Identification

Figure 3 for Spectral Enhancement and Pseudo-Anchor Guidance for Infrared-Visible Person Re-Identification

Figure 4 for Spectral Enhancement and Pseudo-Anchor Guidance for Infrared-Visible Person Re-Identification

Abstract:The development of deep learning has facilitated the application of person re-identification (ReID) technology in intelligent security. Visible-infrared person re-identification (VI-ReID) aims to match pedestrians across infrared and visible modality images enabling 24-hour surveillance. Current studies relying on unsupervised modality transformations as well as inefficient embedding constraints to bridge the spectral differences between infrared and visible images, however, limit their potential performance. To tackle the limitations of the above approaches, this paper introduces a simple yet effective Spectral Enhancement and Pseudo-anchor Guidance Network, named SEPG-Net. Specifically, we propose a more homogeneous spectral enhancement scheme based on frequency domain information and greyscale space, which avoids the information loss typically caused by inefficient modality transformations. Further, a Pseudo Anchor-guided Bidirectional Aggregation (PABA) loss is introduced to bridge local modality discrepancies while better preserving discriminative identity embeddings. Experimental results on two public benchmark datasets demonstrate the superior performance of SEPG-Net against other state-of-the-art methods. The code is available at https://github.com/1024AILab/ReID-SEPG.

Via

Access Paper or Ask Questions

DAPONet: A Dual Attention and Partially Overparameterized Network for Real-Time Road Damage Detection

Sep 03, 2024

Weichao Pan, Jiaju Kang, Xu Wang, Zhihao Chen, Yiyuan Ge

Abstract:Current road damage detection methods, relying on manual inspections or sensor-mounted vehicles, are inefficient, limited in coverage, and often inaccurate, especially for minor damages, leading to delays and safety hazards. To address these issues and enhance real-time road damage detection using street view image data (SVRDD), we propose DAPONet, a model incorporating three key modules: a dual attention mechanism combining global and local attention, a multi-scale partial over-parameterization module, and an efficient downsampling module. DAPONet achieves a mAP50 of 70.1% on the SVRDD dataset, outperforming YOLOv10n by 10.4%, while reducing parameters to 1.6M and FLOPs to 1.7G, representing reductions of 41% and 80%, respectively. On the MS COCO2017 val dataset, DAPONet achieves an mAP50-95 of 33.4%, 0.8% higher than EfficientDet-D1, with a 74% reduction in both parameters and FLOPs.

Via

Access Paper or Ask Questions

Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification

Jul 15, 2024

Zhihao Chen, Yiyuan Ge, Qing Yue

Figure 1 for Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification

Figure 2 for Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification

Figure 3 for Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification

Figure 4 for Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification

Abstract:Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduce noise, which destroys the discriminative features and leads to an uncontrollable disentanglement process. In this paper, we propose a new person re-identification network called features reconstruction disentanglement ReID (FRD-ReID), which can controllably decouple the clothing-unrelated and clothing-related features. Specifically, we first introduce the human parsing mask as the ground truth of the reconstruction process. At the same time, we propose the far away attention (FAA) mechanism and the person contour attention (PCA) mechanism for clothing-unrelated features and pedestrian contour features to improve the feature reconstruction efficiency. In the testing phase, we directly discard the clothing-related features for inference,which leads to a controllable disentanglement process. We conducted extensive experiments on the PRCC, LTCC, and Vc-Clothes datasets and demonstrated that our method outperforms existing state-of-the-art methods.

* 2024 International Conference on Intelligent Computing

Via

Access Paper or Ask Questions

HiDiff: Hybrid Diffusion Framework for Medical Image Segmentation

Jul 03, 2024

Tao Chen, Chenhui Wang, Zhihao Chen, Yiming Lei, Hongming Shan

Figure 1 for HiDiff: Hybrid Diffusion Framework for Medical Image Segmentation

Figure 2 for HiDiff: Hybrid Diffusion Framework for Medical Image Segmentation

Abstract:Medical image segmentation has been significantly advanced with the rapid development of deep learning (DL) techniques. Existing DL-based segmentation models are typically discriminative; i.e., they aim to learn a mapping from the input image to segmentation masks. However, these discriminative methods neglect the underlying data distribution and intrinsic class characteristics, suffering from unstable feature space. In this work, we propose to complement discriminative segmentation methods with the knowledge of underlying data distribution from generative models. To that end, we propose a novel hybrid diffusion framework for medical image segmentation, termed HiDiff, which can synergize the strengths of existing discriminative segmentation models and new generative diffusion models. HiDiff comprises two key components: discriminative segmentor and diffusion refiner. First, we utilize any conventional trained segmentation models as discriminative segmentor, which can provide a segmentation mask prior for diffusion refiner. Second, we propose a novel binary Bernoulli diffusion model (BBDM) as the diffusion refiner, which can effectively, efficiently, and interactively refine the segmentation mask by modeling the underlying data distribution. Third, we train the segmentor and BBDM in an alternate-collaborative manner to mutually boost each other. Extensive experimental results on abdomen organ, brain tumor, polyps, and retinal vessels segmentation datasets, covering four widely-used modalities, demonstrate the superior performance of HiDiff over existing medical segmentation algorithms, including the state-of-the-art transformer- and diffusion-based ones. In addition, HiDiff excels at segmenting small objects and generalizing to new datasets. Source codes are made available at https://github.com/takimailto/HiDiff.

* Accepted by IEEE Transactions on Medical Imaging 2024

Via

Access Paper or Ask Questions

Reliable Source Approximation: Source-Free Unsupervised Domain Adaptation for Vestibular Schwannoma MRI Segmentation

May 25, 2024

Hongye Zeng, Ke Zou, Zhihao Chen, Rui Zheng, Huazhu Fu

Figure 1 for Reliable Source Approximation: Source-Free Unsupervised Domain Adaptation for Vestibular Schwannoma MRI Segmentation

Figure 2 for Reliable Source Approximation: Source-Free Unsupervised Domain Adaptation for Vestibular Schwannoma MRI Segmentation

Figure 3 for Reliable Source Approximation: Source-Free Unsupervised Domain Adaptation for Vestibular Schwannoma MRI Segmentation

Figure 4 for Reliable Source Approximation: Source-Free Unsupervised Domain Adaptation for Vestibular Schwannoma MRI Segmentation

Abstract:Source-Free Unsupervised Domain Adaptation (SFUDA) has recently become a focus in the medical image domain adaptation, as it only utilizes the source model and does not require annotated target data. However, current SFUDA approaches cannot tackle the complex segmentation task across different MRI sequences, such as the vestibular schwannoma segmentation. To address this problem, we proposed Reliable Source Approximation (RSA), which can generate source-like and structure-preserved images from the target domain for updating model parameters and adapting domain shifts. Specifically, RSA deploys a conditional diffusion model to generate multiple source-like images under the guidance of varying edges of one target image. An uncertainty estimation module is then introduced to predict and refine reliable pseudo labels of generated images, and the prediction consistency is developed to select the most reliable generations. Subsequently, all reliable generated images and their pseudo labels are utilized to update the model. Our RSA is validated on vestibular schwannoma segmentation across multi-modality MRI. The experimental results demonstrate that RSA consistently improves domain adaptation performance over other state-of-the-art SFUDA methods. Code is available at https://github.com/zenghy96/Reliable-Source-Approximation.

* Early accepted by MICCAI 2024

Via

Access Paper or Ask Questions