Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lin Wu

Referring to Any Person

Mar 11, 2025

Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, Lei Zhang

Abstract:Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek

Via

Access Paper or Ask Questions

Refining CNN-based Heatmap Regression with Gradient-based Corner Points for Electrode Localization

Dec 24, 2024

Lin Wu

Abstract:We propose a method for detecting the electrode positions in lithium-ion batteries. The process begins by identifying the region of interest (ROI) in the battery's X-ray image through corner point detection. A convolutional neural network is then used to regress the pole positions within this ROI. Finally, the regressed positions are optimized and corrected using corner point priors, significantly mitigating the loss of localization accuracy caused by operations such as feature map down-sampling and padding during network training. Our findings show that combining traditional pixel gradient analysis with CNN-based heatmap regression for keypoint extraction enhances both accuracy and efficiency, resulting in significant performance improvements.

Via

Access Paper or Ask Questions

UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Nov 13, 2024

Chengyuan Zhang, Yilin Zhang, Lei Zhu, Deyin Liu, Lin Wu, Bo Li, Shichao Zhang, Mohammed Bennamoun, Farid Boussaid

Figure 1 for UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Figure 2 for UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Figure 3 for UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Figure 4 for UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Abstract:This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask-DINO into a two-stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine-tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over-fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state-of-the-art approaches.

* 11 pages, 3 figures

Via

Access Paper or Ask Questions

Medical AI for Early Detection of Lung Cancer: A Survey

Oct 18, 2024

Guohui Cai, Ying Cai, Zeyu Zhang, Yuanzhouhan Cao, Lin Wu, Daji Ergu, Zhinbin Liao, Yang Zhao

Figure 1 for Medical AI for Early Detection of Lung Cancer: A Survey

Figure 2 for Medical AI for Early Detection of Lung Cancer: A Survey

Figure 3 for Medical AI for Early Detection of Lung Cancer: A Survey

Figure 4 for Medical AI for Early Detection of Lung Cancer: A Survey

Abstract:Lung cancer remains one of the leading causes of morbidity and mortality worldwide, making early diagnosis critical for improving therapeutic outcomes and patient prognosis. Computer-aided diagnosis (CAD) systems, which analyze CT images, have proven effective in detecting and classifying pulmonary nodules, significantly enhancing the detection rate of early-stage lung cancer. Although traditional machine learning algorithms have been valuable, they exhibit limitations in handling complex sample data. The recent emergence of deep learning has revolutionized medical image analysis, driving substantial advancements in this field. This review focuses on recent progress in deep learning for pulmonary nodule detection, segmentation, and classification. Traditional machine learning methods, such as SVM and KNN, have shown limitations, paving the way for advanced approaches like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Generative Adversarial Networks (GAN). The integration of ensemble models and novel techniques is also discussed, emphasizing the latest developments in lung cancer diagnosis. Deep learning algorithms, combined with various analytical techniques, have markedly improved the accuracy and efficiency of pulmonary nodule analysis, surpassing traditional methods, particularly in nodule classification. Although challenges remain, continuous technological advancements are expected to further strengthen the role of deep learning in medical diagnostics, especially for early lung cancer detection and diagnosis. A comprehensive list of lung cancer detection models reviewed in this work is available at https://github.com/CaiGuoHui123/Awesome-Lung-Cancer-Detection

Via

Access Paper or Ask Questions

SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation

Aug 01, 2024

Shengbo Tan, Zeyu Zhang, Ying Cai, Daji Ergu, Lin Wu, Binbin Hu, Pengzhang Yu, Yang Zhao

Figure 1 for SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation

Figure 2 for SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation

Figure 3 for SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation

Figure 4 for SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation

Abstract:Medical imaging segmentation plays a significant role in the automatic recognition and analysis of lesions. State-of-the-art methods, particularly those utilizing transformers, have been prominently adopted in 3D semantic segmentation due to their superior performance in scalability and generalizability. However, plain vision transformers encounter challenges due to their neglect of local features and their high computational complexity. To address these challenges, we introduce three key contributions: Firstly, we proposed SegStitch, an innovative architecture that integrates transformers with denoising ODE blocks. Instead of taking whole 3D volumes as inputs, we adapt axial patches and customize patch-wise queries to ensure semantic consistency. Additionally, we conducted extensive experiments on the BTCV and ACDC datasets, achieving improvements up to 11.48% and 6.71% respectively in mDSC, compared to state-of-the-art methods. Lastly, our proposed method demonstrates outstanding efficiency, reducing the number of parameters by 36.7% and the number of FLOPS by 10.7% compared to UNETR. This advancement holds promising potential for adapting our method to real-world clinical practice. The code will be available at https://github.com/goblin327/SegStitch

Via

Access Paper or Ask Questions

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Mar 10, 2024

Junhui Yin, Xinyu Zhang, Lin Wu, Xianghua Xie, Xiaojie Wang

Figure 1 for In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Figure 2 for In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Figure 3 for In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Figure 4 for In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Abstract:Existing pre-trained vision-language models, e.g., CLIP, have demonstrated impressive zero-shot generalization capabilities in various downstream tasks. However, the performance of these models will degrade significantly when test inputs present different distributions. To this end, we explore the concept of test-time prompt tuning (TTPT), which enables the adaptation of the CLIP model to novel downstream tasks through only one step of optimization on an unsupervised objective that involves the test sample. Motivated by in-context learning within field of natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition task. InCPL involves associating a new test sample with very few or even just one labeled example as its in-context prompt. As a result, it can reliably estimate a label for the test sample, thereby facilitating the model adaptation process. InCPL first employs a token net to represent language descriptions as visual prompts that the vision encoder of a CLIP model can comprehend. Paired with in-context examples, we further propose a context-aware unsupervised loss to optimize test sample-aware visual prompts. This optimization allows a pre-trained, frozen CLIP model to be adapted to a test sample from any task using its learned adaptive prompt. Our method has demonstrated superior performance and achieved state-of-the-art results across various downstream datasets.

Via

Access Paper or Ask Questions

A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation

Feb 29, 2024

Hanxi Li, Zhengxun Zhang, Hao Chen, Lin Wu, Bo Li, Deyin Liu, Mingwen Wang

Figure 1 for A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation

Figure 2 for A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation

Figure 3 for A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation

Figure 4 for A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation

Abstract:Effectively addressing the challenge of industrial Anomaly Detection (AD) necessitates an ample supply of defective samples, a constraint often hindered by their scarcity in industrial contexts. This paper introduces a novel algorithm designed to augment defective samples, thereby enhancing AD performance. The proposed method tailors the blended latent diffusion model for defect sample generation, employing a diffusion model to generate defective samples in the latent space. A feature editing process, controlled by a "trimap" mask and text prompts, refines the generated samples. The image generation inference process is structured into three stages: a free diffusion stage, an editing diffusion stage, and an online decoder adaptation stage. This sophisticated inference strategy yields high-quality synthetic defective samples with diverse pattern variations, leading to significantly improved AD accuracies based on the augmented training set. Specifically, on the widely recognized MVTec AD dataset, the proposed method elevates the state-of-the-art (SOTA) performance of AD with augmented data by 1.5%, 1.9%, and 3.1% for AD metrics AP, IAP, and IAP90, respectively. The implementation code of this work can be found at the GitHub repository https://github.com/GrandpaXun242/AdaBLDM.git

* 13 pages,7 figures

Via

Access Paper or Ask Questions

DART: Depth-Enhanced Accurate and Real-Time Background Matting

Feb 24, 2024

Hanxi Li, Guofeng Li, Bo Li, Lin Wu, Yan Cheng

Figure 1 for DART: Depth-Enhanced Accurate and Real-Time Background Matting

Figure 2 for DART: Depth-Enhanced Accurate and Real-Time Background Matting

Figure 3 for DART: Depth-Enhanced Accurate and Real-Time Background Matting

Figure 4 for DART: Depth-Enhanced Accurate and Real-Time Background Matting

Abstract:Matting with a static background, often referred to as ``Background Matting" (BGM), has garnered significant attention within the computer vision community due to its pivotal role in various practical applications like webcasting and photo editing. Nevertheless, achieving highly accurate background matting remains a formidable challenge, primarily owing to the limitations inherent in conventional RGB images. These limitations manifest in the form of susceptibility to varying lighting conditions and unforeseen shadows. In this paper, we leverage the rich depth information provided by the RGB-Depth (RGB-D) cameras to enhance background matting performance in real-time, dubbed DART. Firstly, we adapt the original RGB-based BGM algorithm to incorporate depth information. The resulting model's output undergoes refinement through Bayesian inference, incorporating a background depth prior. The posterior prediction is then translated into a "trimap," which is subsequently fed into a state-of-the-art matting algorithm to generate more precise alpha mattes. To ensure real-time matting capabilities, a critical requirement for many real-world applications, we distill the backbone of our model from a larger and more versatile BGM network. Our experiments demonstrate the superior performance of the proposed method. Moreover, thanks to the distillation operation, our method achieves a remarkable processing speed of 33 frames per second (fps) on a mid-range edge-computing device. This high efficiency underscores DART's immense potential for deployment in mobile applications}

Via

Access Paper or Ask Questions

Semantic-Aware Adversarial Training for Reliable Deep Hashing Retrieval

Oct 23, 2023

Xu Yuan, Zheng Zhang, Xunguang Wang, Lin Wu

Abstract:Deep hashing has been intensively studied and successfully applied in large-scale image retrieval systems due to its efficiency and effectiveness. Recent studies have recognized that the existence of adversarial examples poses a security threat to deep hashing models, that is, adversarial vulnerability. Notably, it is challenging to efficiently distill reliable semantic representatives for deep hashing to guide adversarial learning, and thereby it hinders the enhancement of adversarial robustness of deep hashing-based retrieval models. Moreover, current researches on adversarial training for deep hashing are hard to be formalized into a unified minimax structure. In this paper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the adversarial robustness of deep hashing models. Specifically, we conceive a discriminative mainstay features learning (DMFL) scheme to construct semantic representatives for guiding adversarial learning in deep hashing. Particularly, our DMFL with the strict theoretical guarantee is adaptively optimized in a discriminative learning manner, where both discriminative and semantic properties are jointly considered. Moreover, adversarial examples are fabricated by maximizing the Hamming distance between the hash codes of adversarial samples and mainstay features, the efficacy of which is validated in the adversarial attack trials. Further, we, for the first time, formulate the formalized adversarial training of deep hashing into a unified minimax optimization under the guidance of the generated mainstay codes. Extensive experiments on benchmark datasets show superb attack performance against the state-of-the-art algorithms, meanwhile, the proposed adversarial training can effectively eliminate adversarial perturbations for trustworthy deep hashing-based retrieval. Our code is available at https://github.com/xandery-geek/SAAT.

* in IEEE Transactions on Information Forensics and Security, vol. 18, pp. 4681-4694, 2023

Via

Access Paper or Ask Questions

DSformer: A Double Sampling Transformer for Multivariate Time Series Long-term Prediction

Aug 07, 2023

Chengqing Yu, Fei Wang, Zezhi Shao, Tao Sun, Lin Wu, Yongjun Xu

Abstract:Multivariate time series long-term prediction, which aims to predict the change of data in a long time, can provide references for decision-making. Although transformer-based models have made progress in this field, they usually do not make full use of three features of multivariate time series: global information, local information, and variables correlation. To effectively mine the above three features and establish a high-precision prediction model, we propose a double sampling transformer (DSformer), which consists of the double sampling (DS) block and the temporal variable attention (TVA) block. Firstly, the DS block employs down sampling and piecewise sampling to transform the original series into feature vectors that focus on global information and local information respectively. Then, TVA block uses temporal attention and variable attention to mine these feature vectors from different dimensions and extract key information. Finally, based on a parallel structure, DSformer uses multiple TVA blocks to mine and integrate different features obtained from DS blocks respectively. The integrated feature information is passed to the generative decoder based on a multi-layer perceptron to realize multivariate time series long-term prediction. Experimental results on nine real-world datasets show that DSformer can outperform eight existing baselines.

* Accepted by CIKM 2023 (FULL paper)

Via

Access Paper or Ask Questions