Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chengjian Zheng

USM RNN-T model weights binarization

Jun 06, 2024

Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

Figure 1 for USM RNN-T model weights binarization

Figure 2 for USM RNN-T model weights binarization

Figure 3 for USM RNN-T model weights binarization

Figure 4 for USM RNN-T model weights binarization

Abstract:Large-scale universal speech models (USM) are already used in production. However, as the model size grows, the serving cost grows too. Serving cost of large models is dominated by model size that is why model size reduction is an important research topic. In this work we are focused on model size reduction using weights only quantization. We present the weights binarization of USM Recurrent Neural Network Transducer (RNN-T) and show that its model size can be reduced by 15.9x times at cost of word error rate (WER) increase by only 1.9% in comparison to the float32 model. It makes it attractive for practical applications.

Via

Access Paper or Ask Questions

Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

Jun 02, 2024

Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

Figure 1 for Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

Figure 2 for Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

Figure 3 for Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

Figure 4 for Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

Abstract:Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather than a static state. In this paper, we adopt semi-supervised video semantic segmentation method based on unreliable pseudo labels. Then, We ensemble the teacher network model with the student network model to generate pseudo labels and retrain the student network. Our method achieves the mIoU scores of 63.71% and 67.83% on development test and final test respectively. Finally, we obtain the 1st place in the Video Scene Parsing in the Wild Challenge at CVPR 2024.

* Champion Solution for CVPR 2024 PVUW VSS Track. arXiv admin note: text overlap with arXiv:2306.02894

Via

Access Paper or Ask Questions

2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

Jun 01, 2024

Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

Figure 1 for 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

Figure 2 for 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

Figure 3 for 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

Figure 4 for 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

Abstract:Video Panoptic Segmentation (VPS) is a challenging task that is extends from image panoptic segmentation.VPS aims to simultaneously classify, track, segment all objects in a video, including both things and stuff. Due to its wide application in many downstream tasks such as video understanding, video editing, and autonomous driving. In order to deal with the task of video panoptic segmentation in the wild, we propose a robust integrated video panoptic segmentation solution. We use DVIS++ framework as our baseline to generate the initial masks. Then,we add an additional image semantic segmentation model to further improve the performance of semantic classes.Finally, our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases, respectively, and ultimately ranked 2nd in the VPS track of the PVUW Challenge at CVPR2024.

* 2nd Place Solution for CVPR 2024 PVUW VPS Track

Via

Access Paper or Ask Questions

Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Apr 25, 2024

Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun(+65 more)

Figure 1 for Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Figure 2 for Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Figure 3 for Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Figure 4 for Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Abstract:This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images.

* CVPR 2024, AI for Streaming (AIS) Workshop

Via

Access Paper or Ask Questions

Recyclable Semi-supervised Method Based on Multi-model Ensemble for Video Scene Parsing

Jun 05, 2023

Biao Wu, Shaoli Liu, Diankai Zhang, Chengjian Zheng, Si Gao, Xiaofeng Zhang, Ning Wang

Abstract:Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video semantic segmentation is more reasonable and practical for realistic applications. In this paper, we adopt Mask2Former as architecture and ViT-Adapter as backbone. Then, we propose a recyclable semi-supervised training method based on multi-model ensemble. Our method achieves the mIoU scores of 62.97% and 65.83% on Development test and final test respectively. Finally, we obtain the 2nd place in the Video Scene Parsing in the Wild Challenge at CVPR 2023.

Via

Access Paper or Ask Questions

Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Nov 07, 2022

Andrey Ignatov, Radu Timofte, Cheng-Ming Chiang, Hsien-Kai Kuo, Yu-Syuan Xu, Man-Yu Lee, Allen Lu, Chia-Ming Cheng, Chih-Cheng Chen, Jia-Ying Yong(+43 more)

Figure 1 for Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Figure 2 for Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Figure 3 for Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Figure 4 for Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Abstract:Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.

* arXiv admin note: text overlap with arXiv:2105.08826, arXiv:2105.07809, arXiv:2211.04470, arXiv:2211.03885

Via

Access Paper or Ask Questions

Accelerating RNN-T Training and Inference Using CTC guidance

Oct 29, 2022

Yongqiang Wang, Zhehuai Chen, Chengjian Zheng, Yu Zhang, Wei Han, Parisa Haghani

Abstract:We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from a co-trained connectionist temporal classification (CTC) model. We made a key assumption that if an encoder embedding frame is classified as a blank frame by the CTC model, it is likely that this frame will be aligned to blank for all the partial alignments or hypotheses in RNN-T and it can be discarded from the decoder input. We also show that this frame reduction operation can be applied in the middle of the encoder, which result in significant speed up for the training and inference in RNN-T. We further show that the CTC alignment, a by-product of the CTC decoder, can also be used to perform lattice reduction for RNN-T during training. Our method is evaluated on the Librispeech and SpeechStew tasks. We demonstrate that the proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER).

* submitted to ICASSP 2023

Via

Access Paper or Ask Questions

NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

Apr 25, 2022

Ren Yang, Radu Timofte, Meisong Zheng, Qunliang Xing, Minglang Qiao, Mai Xu, Lai Jiang, Huaida Liu, Ying Chen, Youcheng Ben(+68 more)

Figure 1 for NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

Figure 2 for NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

Figure 3 for NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

Figure 4 for NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

Abstract:This paper reviews the NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video. In this challenge, we proposed the LDV 2.0 dataset, which includes the LDV dataset (240 videos) and 95 additional videos. This challenge includes three tracks. Track 1 aims at enhancing the videos compressed by HEVC at a fixed QP. Track 2 and Track 3 target both the super-resolution and quality enhancement of HEVC compressed video. They require x2 and x4 super-resolution, respectively. The three tracks totally attract more than 600 registrations. In the test phase, 8 teams, 8 teams and 12 teams submitted the final results to Tracks 1, 2 and 3, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution and quality enhancement of compressed video. The proposed LDV 2.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge (including open-sourced codes) is at https://github.com/RenYang-home/NTIRE22_VEnh_SR.

Via

Access Paper or Ask Questions