Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Feng Hong

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Jul 24, 2025

Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao

Figure 1 for Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Figure 2 for Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Figure 3 for Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Figure 4 for Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Abstract:Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

Via

Access Paper or Ask Questions

Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Jul 17, 2025

Zihua Zhao, Feng Hong, Mengxi Chen, Pengyi Chen, Benyuan Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Figure 1 for Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Figure 2 for Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Figure 3 for Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Figure 4 for Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Abstract:The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-larger datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods. Source code is available at: https://github.com/MediaBrain-SJTU/DISSect.

Via

Access Paper or Ask Questions

Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

May 06, 2025

Nikhil M. Pawar, Jorge A. Prozzi, Feng Hong, Surya Sarat Chandra Congress

Figure 1 for Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

Figure 2 for Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

Figure 3 for Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

Figure 4 for Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

Abstract:Recently, there has been an impetus for the application of cutting-edge data collection platforms such as drones mounted with camera sensors for infrastructure asset management. However, the sensor characteristics, proximity to the structure, hard-to-reach access, and environmental conditions often limit the resolution of the datasets. A few studies used super-resolution techniques to address the problem of low-resolution images. Nevertheless, these techniques were observed to increase computational cost and false alarms of distress detection due to the consideration of all the infrastructure images i.e., positive and negative distress classes. In order to address the pre-processing of false alarm and achieve efficient super-resolution, this study developed a framework consisting of convolutional neural network (CNN) and efficient sub-pixel convolutional neural network (ESPCNN). CNN accurately classified both the classes. ESPCNN, which is the lightweight super-resolution technique, generated high-resolution infrastructure image of positive distress obtained from CNN. The ESPCNN outperformed bicubic interpolation in all the evaluation metrics for super-resolution. Based on the performance metrics, the combination of CNN and ESPCNN was observed to be effective in preprocessing the infrastructure images with negative distress, reducing the computational cost and false alarms in the next step of super-resolution. The visual inspection showed that EPSCNN is able to capture crack propagation, complex geometry of even minor cracks. The proposed framework is expected to help the highway agencies in accurately performing distress detection and assist in efficient asset management practices.

* Presented :Transportation Research Board 104th Annual Meeting, Washington, D.C

Via

Access Paper or Ask Questions

Learning to Instruct for Visual Instruction Tuning

Mar 28, 2025

Zhihan Zhou, Feng Hong, Jiaan Luo, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang

Abstract:We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.

* 16 pages, 10 figures

Via

Access Paper or Ask Questions

Diversified Batch Selection for Training Acceleration

Jun 07, 2024

Feng Hong, Yueming Lyu, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Yanfeng Wang

Abstract:The remarkable success of modern machine learning models on large datasets often demands extensive training time and resource consumption. To save cost, a prevalent research line, known as online batch selection, explores selecting informative subsets during the training process. Although recent efforts achieve advancements by measuring the impact of each sample on generalization, their reliance on additional reference models inherently limits their practical applications, when there are no such ideal models available. On the other hand, the vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner, which sacrifices the diversity and induces the redundancy. To tackle this dilemma, we propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples. Specifically, we define a novel selection objective that measures the group-wise orthogonalized representativeness to combat the redundancy issue of previous sample-wise criteria, and provide a principled selection-efficient realization. Extensive experiments across various tasks demonstrate the significant superiority of DivBS in the performance-speedup trade-off. The code is publicly available.

* ICML 2024

Via

Access Paper or Ask Questions

A framework of text-dependent speaker verification for chinese numerical string corpus

May 11, 2024

Litong Zheng, Feng Hong, Weijie Xu, Wan Zheng

Figure 1 for A framework of text-dependent speaker verification for chinese numerical string corpus

Figure 2 for A framework of text-dependent speaker verification for chinese numerical string corpus

Figure 3 for A framework of text-dependent speaker verification for chinese numerical string corpus

Figure 4 for A framework of text-dependent speaker verification for chinese numerical string corpus

Abstract:The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.

* arXiv admin note: text overlap with arXiv:2312.01645

Via

Access Paper or Ask Questions

UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification

Dec 18, 2023

Tianjie Dai, Ruipeng Zhang, Feng Hong, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Figure 1 for UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification

Figure 2 for UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification

Figure 3 for UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification

Figure 4 for UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification

Abstract:Vision-Language Pre-training (VLP) that utilizes the multi-modal information to promote the training efficiency and effectiveness, has achieved great success in vision recognition of natural domains and shown promise in medical imaging diagnosis for the Chest X-Rays (CXRs). However, current works mainly pay attention to the exploration on single dataset of CXRs, which locks the potential of this powerful paradigm on larger hybrid of multi-source CXRs datasets. We identify that although blending samples from the diverse sources offers the advantages to improve the model generalization, it is still challenging to maintain the consistent superiority for the task of each source due to the existing heterogeneity among sources. To handle this dilemma, we design a Conquer-and-Divide pre-training framework, termed as UniChest, aiming to make full use of the collaboration benefit of multiple sources of CXRs while reducing the negative influence of the source heterogeneity. Specially, the ``Conquer" stage in UniChest encourages the model to sufficiently capture multi-source common patterns, and the ``Divide" stage helps squeeze personalized patterns into different small experts (query networks). We conduct thorough experiments on many benchmarks, e.g., ChestX-ray14, CheXpert, Vindr-CXR, Shenzhen, Open-I and SIIM-ACR Pneumothorax, verifying the effectiveness of UniChest over a range of baselines, and release our codes and pre-training models at https://github.com/Elfenreigen/UniChest.

Via

Access Paper or Ask Questions

A text-dependent speaker verification application framework based on Chinese numerical string corpus

Dec 04, 2023

Litong Zheng, Feng Hong, Weijie Xu

Figure 1 for A text-dependent speaker verification application framework based on Chinese numerical string corpus

Figure 2 for A text-dependent speaker verification application framework based on Chinese numerical string corpus

Figure 3 for A text-dependent speaker verification application framework based on Chinese numerical string corpus

Figure 4 for A text-dependent speaker verification application framework based on Chinese numerical string corpus

Abstract:Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strategies such as more fine-grained pooling methods on time scales and decoupled representations of speech speaker embedding and text embedding are more suitable for TD-SV. We have introduced an end-to-end TD-SV system based on a dataset comprising longer Chinese numerical string texts. It contains a text embedding network, a speaker embedding network, and back-end fusion. First, we recorded a dataset consisting of long Chinese numerical text named SHAL, which is publicly available on the Open-SLR website. We addressed the issue of dataset scarcity by augmenting it using Tacotron2 and HiFi-GAN. Next, we introduced a dual representation of speech with text embedding and speaker embedding. In the text embedding network, we employed an enhanced Transformer and introduced a triple loss that includes text classification loss, CTC loss, and decoder loss. For the speaker embedding network, we enhanced a sliding window attentive statistics pooling (SWASP), combined with attentive statistics pooling (ASP) to create a multi-scale pooling method. Finally, we fused text embedding and speaker embedding. Our pooling methods achieved an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.

Via

Access Paper or Ask Questions

Combating Representation Learning Disparity with Geometric Harmonization

Oct 26, 2023

Zhihan Zhou, Jiangchao Yao, Feng Hong, Ya Zhang, Bo Han, Yanfeng Wang

Figure 1 for Combating Representation Learning Disparity with Geometric Harmonization

Figure 2 for Combating Representation Learning Disparity with Geometric Harmonization

Figure 3 for Combating Representation Learning Disparity with Geometric Harmonization

Figure 4 for Combating Representation Learning Disparity with Geometric Harmonization

Abstract:Self-supervised learning (SSL) as an effective paradigm of representation learning has achieved tremendous success on various curated datasets in diverse scenarios. Nevertheless, when facing the long-tailed distribution in real-world applications, it is still hard for existing methods to capture transferable and robust representation. Conventional SSL methods, pursuing sample-level uniformity, easily leads to representation learning disparity where head classes dominate the feature regime but tail classes passively collapse. To address this problem, we propose a novel Geometric Harmonization (GH) method to encourage category-level uniformity in representation learning, which is more benign to the minority and almost does not hurt the majority under long-tailed distribution. Specially, GH measures the population statistics of the embedding space on top of self-supervised learning, and then infer an fine-grained instance-wise calibration to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing methods in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of GH with high tolerance to the distribution skewness. Our code is available at https://github.com/MediaBrain-SJTU/Geometric-Harmonization.

* Accepted to NeurIPS 2023 (spotlight)

Via

Access Paper or Ask Questions

Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

Oct 24, 2023

Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran(+15 more)

Figure 1 for Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

Figure 2 for Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

Figure 3 for Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

Figure 4 for Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

Abstract:Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" $\unicode{x2013}$ there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of long-tailed learning in medical image recognition, few have studied the interaction of label imbalance and label co-occurrence posed by long-tailed, multi-label disease classification. To engage with the research community on this emerging topic, we conducted an open challenge, CXR-LT, on long-tailed, multi-label thorax disease classification from chest X-rays (CXRs). We publicly release a large-scale benchmark dataset of over 350,000 CXRs, each labeled with at least one of 26 clinical findings following a long-tailed distribution. We synthesize common themes of top-performing solutions, providing practical recommendations for long-tailed, multi-label medical image classification. Finally, we use these insights to propose a path forward involving vision-language foundation models for few- and zero-shot disease classification.

Via

Access Paper or Ask Questions