Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jing Hao

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Dec 18, 2024

Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang

Figure 1 for Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Figure 2 for Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Figure 3 for Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Figure 4 for Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Abstract:Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \url{https://github.com/syp2ysy/DCE}.

* An open-source data engine for generating detailed image captions

Via

Access Paper or Ask Questions

T-Mamba: Frequency-Enhanced Gated Long-Range Dependency for Tooth 3D CBCT Segmentation

Apr 01, 2024

Jing Hao, Lei He, Kuo Feng Hung

Abstract:Efficient tooth segmentation in three-dimensional (3D) imaging, critical for orthodontic diagnosis, remains challenging due to noise, low contrast, and artifacts in CBCT images. Both convolutional Neural Networks (CNNs) and transformers have emerged as popular architectures for image segmentation. However, their efficacy in handling long-range dependencies is limited due to inherent locality or computational complexity. To address this issue, we propose T-Mamba, integrating shared positional encoding and frequency-based features into vision mamba, to address limitations in spatial position preservation and feature enhancement in frequency domain. Besides, we also design a gate selection unit to integrate two features in spatial domain and one feature in frequency domain adaptively. T-Mamba is the first work to introduce frequency-based features into vision mamba. Extensive experiments demonstrate that T-Mamba achieves new SOTA results on the public Tooth CBCT dataset and outperforms previous SOTA methods by a large margin, i.e., IoU + 3.63%, SO + 2.43%, DSC +2.30%, HD -4.39mm, and ASSD -0.37mm. The code and models are publicly available at https://github.com/isbrycee/T-Mamba.

Via

Access Paper or Ask Questions

GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

Jan 27, 2024

Jing Hao, Moyun Liu, Kuo Feng Hung

Abstract:Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: https://github.com/isbrycee/GEM-Glass-Segmentor.

* 14 pages, 9 figures, 7 tables

Via

Access Paper or Ask Questions

Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Sep 18, 2023

Ziming Wang, Shumin Han, Xiaodi Wang, Jing Hao, Xianbin Cao, Baochang Zhang

Figure 1 for Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Figure 2 for Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Figure 3 for Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Figure 4 for Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Abstract:Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.

Via

Access Paper or Ask Questions

Simple parameter-free self-attention approximation

Jul 22, 2023

Yuwen Zhai, Jing Hao, Liang Gao, Xinyu Li, Yiping Gao, Shumin Han

Abstract:The hybrid model of self-attention and convolution is one of the methods to lighten ViT. The quadratic computational complexity of self-attention with respect to token length limits the efficiency of ViT on edge devices. We propose a self-attention approximation without training parameters, called SPSA, which captures global spatial features with linear complexity. To verify the effectiveness of SPSA combined with convolution, we conduct extensive experiments on image classification and object detection tasks.

Via

Access Paper or Ask Questions

A Stronger Stitching Algorithm for Fisheye Images based on Deblurring and Registration

Jul 22, 2023

Jing Hao, Jingming Xie, Jinyuan Zhang, Moyun Liu

Abstract:Fisheye lens, which is suitable for panoramic imaging, has the prominent advantage of a large field of view and low cost. However, the fisheye image has a severe geometric distortion which may interfere with the stage of image registration and stitching. Aiming to resolve this drawback, we devise a stronger stitching algorithm for fisheye images by combining the traditional image processing method with deep learning. In the stage of fisheye image correction, we propose the Attention-based Nonlinear Activation Free Network (ANAFNet) to deblur fisheye images corrected by Zhang calibration method. Specifically, ANAFNet adopts the classical single-stage U-shaped architecture based on convolutional neural networks with soft-attention technique and it can restore a sharp image from a blurred image effectively. In the part of image registration, we propose the ORB-FREAK-GMS (OFG), a comprehensive image matching algorithm, to improve the accuracy of image registration. Experimental results demonstrate that panoramic images of superior quality stitching by fisheye images can be obtained through our method.

* 6 pages, 5 figures

Via

Access Paper or Ask Questions

Language-aware Multiple Datasets Detection Pretraining for DETRs

Apr 07, 2023

Jing Hao, Song Chen, Xiaodi Wang, Shumin Han

Figure 1 for Language-aware Multiple Datasets Detection Pretraining for DETRs

Figure 2 for Language-aware Multiple Datasets Detection Pretraining for DETRs

Figure 3 for Language-aware Multiple Datasets Detection Pretraining for DETRs

Figure 4 for Language-aware Multiple Datasets Detection Pretraining for DETRs

Abstract:Pretraining on large-scale datasets can boost the performance of object detectors while the annotated datasets for object detection are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly pretrain models across aggregation of datasets to enhance data volume and diversity. In this paper, we propose a strong framework for utilizing Multiple datasets to pretrain DETR-like detectors, termed METR, without the need for manual label spaces integration. It converts the typical multi-classification in object detection into binary classification by introducing a pre-trained language model. Specifically, we design a category extraction module for extracting potential categories involved in an image and assign these categories into different queries by language embeddings. Each query is only responsible for predicting a class-specific object. Besides, to adapt our novel detection paradigm, we propose a group bipartite matching strategy that limits the ground truths to match queries assigned to the same category. Extensive experiments demonstrate that METR achieves extraordinary results on either multi-task joint training or the pretrain & finetune paradigm. Notably, our pre-trained models have high flexible transferability and increase the performance upon various DETR-like detectors on COCO val2017 benchmark. Codes will be available after this paper is published.

Via

Access Paper or Ask Questions

A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld Images

Oct 18, 2021

Moyun Liu, Jingming Xie, Jing Hao, Yang Zhang, Xuzhan Chen, Youping Chen

Figure 1 for A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld Images

Figure 2 for A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld Images

Figure 3 for A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld Images

Figure 4 for A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld Images

Abstract:X-ray images are commonly used to ensure the security of devices in quality inspection industry. The recognition of signs printed on X-ray weld images plays an essential role in digital traceability system of manufacturing industry. However, the scales of objects vary different greatly in weld images, and it hinders us to achieve satisfactory recognition. In this paper, we propose a signs recognition framework based on convolutional neural networks (CNNs) for weld images. The proposed framework firstly contains a shallow classification network for correcting the pose of images. Moreover, we present a novel spatial and channel enhancement (SCE) module to address the above scale problem. This module can integrate multi-scale features and adaptively assign weights for each feature source. Based on SCE module, a narrow network is designed for final weld information recognition. To enhance the practicability of our framework, we carefully design the architecture of framework with a few parameters and computations. Experimental results show that our framework achieves 99.7% accuracy with 1.1 giga floating-point of operations (GFLOPs) on classification stage, and 90.0 mean average precision (mAP) with 176.1 frames per second (FPS) on recognition stage.

Via

Access Paper or Ask Questions

TransForensics: Image Forgery Localization with Dense Self-Attention

Aug 09, 2021

Jing Hao, Zhixin Zhang, Shicai Yang, Di Xie, Shiliang Pu

Figure 1 for TransForensics: Image Forgery Localization with Dense Self-Attention

Figure 2 for TransForensics: Image Forgery Localization with Dense Self-Attention

Figure 3 for TransForensics: Image Forgery Localization with Dense Self-Attention

Figure 4 for TransForensics: Image Forgery Localization with Dense Self-Attention

Abstract:Nowadays advanced image editing tools and technical skills produce tampered images more realistically, which can easily evade image forensic systems and make authenticity verification of images more difficult. To tackle this challenging problem, we introduce TransForensics, a novel image forgery localization method inspired by Transformers. The two major components in our framework are dense self-attention encoders and dense correction modules. The former is to model global context and all pairwise interactions between local patches at different scales, while the latter is used for improving the transparency of the hidden layers and correcting the outputs from different branches. Compared to previous traditional and deep learning methods, TransForensics not only can capture discriminative representations and obtain high-quality mask predictions but is also not limited by tampering types and patch sequence orders. By conducting experiments on main benchmarks, we show that TransForensics outperforms the stateof-the-art methods by a large margin.

* accepted by ICCV2021

Via

Access Paper or Ask Questions

An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy

Nov 21, 2019

Jiaxu Chen, Jing Hao, Kai Chen, Di Xie, Shicai Yang, Shiliang Pu

Figure 1 for An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy

Figure 2 for An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy

Figure 3 for An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy

Figure 4 for An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy

Abstract:Audio classification can distinguish different kinds of sounds, which is helpful for intelligent applications in daily life. However, it remains a challenging task since the sound events in an audio clip is probably multiple, even overlapping. This paper introduces an end-to-end audio classification system based on raw waveforms and mix-training strategy. Compared to human-designed features which have been widely used in existing research, raw waveforms contain more complete information and are more appropriate for multi-label classification. Taking raw waveforms as input, our network consists of two variants of ResNet structure which can learn a discriminative representation. To explore the information in intermediate layers, a multi-level prediction with attention structure is applied in our model. Furthermore, we design a mix-training strategy to break the performance limitation caused by the amount of training data. Experiments show that the mean average precision of the proposed audio classification system on Audio Set dataset is 37.2%. Without using extra training data, our system exceeds the state-of-the-art multi-level attention model.

* InterSpeech 2019

Via

Access Paper or Ask Questions