Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Huiqi Li

LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

Mar 16, 2025

Feihong Yan, Qingyan Wei, Jiayi Tang, Jiajun Li, Yulin Wang, Xuming Hu, Huiqi Li, Linfeng Zhang

Abstract:Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in https://github.com/feihongyan1/LazyMAR.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

RetSTA: An LLM-Based Approach for Standardizing Clinical Fundus Image Reports

Mar 12, 2025

Jiushen Cai, Weihang Zhang, Hanruo Liu, Ningli Wang, Huiqi Li

Abstract:Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability. The checkpoints are available at https://github.com/AB-Story/RetSTA-7B.

Via

Access Paper or Ask Questions

ViLReF: A Chinese Vision-Language Retinal Foundation Model

Aug 20, 2024

Shengzhu Yang, Jiawei Du, Jia Guo, Weihang Zhang, Hanruo Liu, Huiqi Li, Ningli Wang

Figure 1 for ViLReF: A Chinese Vision-Language Retinal Foundation Model

Figure 2 for ViLReF: A Chinese Vision-Language Retinal Foundation Model

Figure 3 for ViLReF: A Chinese Vision-Language Retinal Foundation Model

Figure 4 for ViLReF: A Chinese Vision-Language Retinal Foundation Model

Abstract:Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model's learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum encoders, to supply extra samples and compensate for the vacancies caused by eliminating false negatives. Extensive experiments are conducted on multiple datasets for downstream classification and segmentation tasks. The experimental results demonstrate the powerful zero-shot and transfer learning capabilities of ViLReF, verifying the effectiveness of our pre-training strategy. Our ViLReF model is available at: https://github.com/T6Yang/ViLReF.

Via

Access Paper or Ask Questions

RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports

May 23, 2024

Jiawei Du, Jia Guo, Weihang Zhang, Shengzhu Yang, Hanruo Liu, Huiqi Li, Ningli Wang

Abstract:The Vision-Language Foundation model is increasingly investigated in the fields of computer vision and natural language processing, yet its exploration in ophthalmology and broader medical applications remains limited. The challenge is the lack of labeled data for the training of foundation model. To handle this issue, a CLIP-style retinal image foundation model is developed in this paper. Our foundation model, RET-CLIP, is specifically trained on a dataset of 193,865 patients to extract general features of color fundus photographs (CFPs), employing a tripartite optimization strategy to focus on left eye, right eye, and patient level to reflect real-world clinical scenarios. Extensive experiments demonstrate that RET-CLIP outperforms existing benchmarks across eight diverse datasets spanning four critical diagnostic categories: diabetic retinopathy, glaucoma, multiple disease diagnosis, and multi-label classification of multiple diseases, which demonstrate the performance and generality of our foundation model. The sourse code and pre-trained model are available at https://github.com/sStonemason/RET-CLIP.

Via

Access Paper or Ask Questions

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

May 23, 2024

Jia Guo, Shuai Lu, Weihang Zhang, Huiqi Li

Abstract:Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images, serving as an alternative to the conventional one-class-one-model setup. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across three popular anomaly detection benchmarks including MVTec-AD, VisA, and the recently released Real-IAD. Our proposed Dinomaly achieves impressive image AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also surpasses the most advanced class-separated UAD records.

Via

Access Paper or Ask Questions

Absolute-Unified Multi-Class Anomaly Detection via Class-Agnostic Distribution Alignment

Mar 31, 2024

Jia Guo, Shuai Lu, Weihang Zhang, Huiqi Li

Abstract:Conventional unsupervised anomaly detection (UAD) methods build separate models for each object category. Recent studies have proposed to train a unified model for multiple classes, namely model-unified UAD. However, such methods still implement the unified model separately on each class during inference with respective anomaly decision thresholds, which hinders their application when the image categories are entirely unavailable. In this work, we present a simple yet powerful method to address multi-class anomaly detection without any class information, namely \textit{absolute-unified} UAD. We target the crux of prior works in this challenging setting: different objects have mismatched anomaly score distributions. We propose Class-Agnostic Distribution Alignment (CADA) to align the mismatched score distribution of each implicit class without knowing class information, which enables unified anomaly detection for all classes and samples. The essence of CADA is to predict each class's score distribution of normal samples given any image, normal or anomalous, of this class. As a general component, CADA can activate the potential of nearly all UAD methods under absolute-unified setting. Our approach is extensively evaluated under the proposed setting on two popular UAD benchmark datasets, MVTec AD and VisA, where we exceed previous state-of-the-art by a large margin.

Via

Access Paper or Ask Questions

ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction

Jun 05, 2023

Jia Guo, Shuai Lu, Lize Jia, Weihang Zhang, Huiqi Li

Figure 1 for ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction

Figure 2 for ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction

Figure 3 for ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction

Figure 4 for ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction

Abstract:Most advanced unsupervised anomaly detection (UAD) methods rely on modeling feature representations of frozen encoder networks pre-trained on large-scale datasets, e.g. ImageNet. However, the features extracted from the encoders that are borrowed from natural image domains coincide little with the features required in the target UAD domain, such as industrial inspection and medical imaging. In this paper, we propose a novel epistemic UAD method, namely ReContrast, which optimizes the entire network to reduce biases towards the pre-trained image domain and orients the network in the target domain. We start with a feature reconstruction approach that detects anomalies from errors. Essentially, the elements of contrastive learning are elegantly embedded in feature reconstruction to prevent the network from training instability, pattern collapse, and identical shortcut, while simultaneously optimizing both the encoder and decoder on the target domain. To demonstrate our transfer ability on various image domains, we conduct extensive experiments across two popular industrial defect detection benchmarks and three medical image UAD tasks, which shows our superiority over current state-of-the-art methods.

* under review

Via

Access Paper or Ask Questions

Efficient automatic segmentation for multi-level pulmonary arteries: The PARSE challenge

Apr 07, 2023

Gongning Luo, Kuanquan Wang, Jun Liu, Shuo Li, Xinjie Liang, Xiangyu Li, Shaowei Gan, Wei Wang, Suyu Dong, Wenyi Wang(+20 more)

Figure 1 for Efficient automatic segmentation for multi-level pulmonary arteries: The PARSE challenge

Figure 2 for Efficient automatic segmentation for multi-level pulmonary arteries: The PARSE challenge

Figure 3 for Efficient automatic segmentation for multi-level pulmonary arteries: The PARSE challenge

Figure 4 for Efficient automatic segmentation for multi-level pulmonary arteries: The PARSE challenge

Abstract:Efficient automatic segmentation of multi-level (i.e. main and branch) pulmonary arteries (PA) in CTPA images plays a significant role in clinical applications. However, most existing methods concentrate only on main PA or branch PA segmentation separately and ignore segmentation efficiency. Besides, there is no public large-scale dataset focused on PA segmentation, which makes it highly challenging to compare the different methods. To benchmark multi-level PA segmentation algorithms, we organized the first \textbf{P}ulmonary \textbf{AR}tery \textbf{SE}gmentation (PARSE) challenge. On the one hand, we focus on both the main PA and the branch PA segmentation. On the other hand, for better clinical application, we assign the same score weight to segmentation efficiency (mainly running time and GPU memory consumption during inference) while ensuring PA segmentation accuracy. We present a summary of the top algorithms and offer some suggestions for efficient and accurate multi-level PA automatic segmentation. We provide the PARSE challenge as open-access for the community to benchmark future algorithm developments at \url{https://parse2022.grand-challenge.org/Parse2022/}.

Via

Access Paper or Ask Questions

GAMMA Challenge:Glaucoma grAding from Multi-Modality imAges

Feb 16, 2022

Junde Wu, Huihui Fang, Fei Li, Huazhu Fu, Fengbin Lin, Jiongcheng Li, Lexing Huang, Qinji Yu, Sifan Song, Xingxing Xu(+19 more)

Figure 1 for GAMMA Challenge:Glaucoma grAding from Multi-Modality imAges

Figure 2 for GAMMA Challenge:Glaucoma grAding from Multi-Modality imAges

Figure 3 for GAMMA Challenge:Glaucoma grAding from Multi-Modality imAges

Figure 4 for GAMMA Challenge:Glaucoma grAding from Multi-Modality imAges

Abstract:Color fundus photography and Optical Coherence Tomography (OCT) are the two most cost-effective tools for glaucoma screening. Both two modalities of images have prominent biomarkers to indicate glaucoma suspected. Clinically, it is often recommended to take both of the screenings for a more accurate and reliable diagnosis. However, although numerous algorithms are proposed based on fundus images or OCT volumes in computer-aided diagnosis, there are still few methods leveraging both of the modalities for the glaucoma assessment. Inspired by the success of Retinal Fundus Glaucoma Challenge (REFUGE) we held previously, we set up the Glaucoma grAding from Multi-Modality imAges (GAMMA) Challenge to encourage the development of fundus \& OCT-based glaucoma grading. The primary task of the challenge is to grade glaucoma from both the 2D fundus images and 3D OCT scanning volumes. As part of GAMMA, we have publicly released a glaucoma annotated dataset with both 2D fundus color photography and 3D OCT volumes, which is the first multi-modality dataset for glaucoma grading. In addition, an evaluation framework is also established to evaluate the performance of the submitted methods. During the challenge, 1272 results were submitted, and finally, top-10 teams were selected to the final stage. We analysis their results and summarize their methods in the paper. Since all these teams submitted their source code in the challenge, a detailed ablation study is also conducted to verify the effectiveness of the particular modules proposed. We find many of the proposed techniques are practical for the clinical diagnosis of glaucoma. As the first in-depth study of fundus \& OCT multi-modality glaucoma grading, we believe the GAMMA Challenge will be an essential starting point for future research.

Via

Access Paper or Ask Questions

Calcaneus Radiograph Analysis System: Rotation-Invariant Landmark Detection, Calcaneal Angle Measurement and Fracture Identification

Feb 05, 2020

Jia Guo, Wei Wang, Huanxin Yan, Junxian Chen, Hailin Xu, Huiqi Li

Figure 1 for Calcaneus Radiograph Analysis System: Rotation-Invariant Landmark Detection, Calcaneal Angle Measurement and Fracture Identification

Figure 2 for Calcaneus Radiograph Analysis System: Rotation-Invariant Landmark Detection, Calcaneal Angle Measurement and Fracture Identification

Figure 3 for Calcaneus Radiograph Analysis System: Rotation-Invariant Landmark Detection, Calcaneal Angle Measurement and Fracture Identification

Figure 4 for Calcaneus Radiograph Analysis System: Rotation-Invariant Landmark Detection, Calcaneal Angle Measurement and Fracture Identification

Abstract:Calcaneus is the largest tarsal bone to withstand the daily stresses of weight bearing. The calcaneal fracture is the most common type in the tarsal bone fractures. After a fracture is suspected, plain radiographs should be taken first. Bohler's Angle (BA) and Critical Angle of Gissane (CAG), measured by four anatomic landmarks in lateral foot radiograph, can aid operative restoration of the fractured calcaneus and fracture diagnosis and assessment. The aim of this study is to develop a system to automatically locate four anatomic landmarks and measure BA and CAG for fracture assessment. To solved the problem of fickle rotation of calcaneus, we proposed a coarse-to-fine Rotation-Invariant Regression-Voting (RIRV) landmark detection method based on Supported Vector Regression (SVR) and Scale Invariant Feature Transform (SIFT) patch descriptor. By implementing a novel normalization approach to convert displacements into coordinates of oriented feature patches, our method is explicit rotation-invariance comparing with traditional regressive method. A multi-stream CNN structure with multi-region input is designed to screen calcaneus fracture. The input ROIs of multi-stream CNN are normalized by detected landmarks to uniform view, orientation and scale. The advantage of our approach is the usage of landmarks using prior knowledge to normalize the inputs of CNN so as to improve the efficiency of CNN. Experiments show that our CNN can accurately identify the fractures with sensitivity of 95.21% and specificity of 95.32%.

* 17 pages,10 figures. Has been submitted to Artifical Intelligence in Medicine. Under Review

Via

Access Paper or Ask Questions