Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiangtao Wen

Backslash: Rate Constrained Optimized Training of Large Language Models

Apr 23, 2025

Jun Wu, Jiangtao Wen, Yuxing Han

Abstract:The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (Backslash), a novel training-time compression approach based on rate-distortion optimization (RDO). Backslash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that Backslash can reduce memory usage by 60\% - 90\% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, Backslash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80\% pruning rates), and enables network simplification for accelerated inference on edge devices.

Via

Access Paper or Ask Questions

Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation

Apr 16, 2025

Jie Wang, Chen Ye Gan, Caoqi Wei, Jiangtao Wen, Yuxing Han

Abstract:Feature matching across video streams remains a cornerstone challenge in computer vision. Increasingly, robust multimodal matching has garnered interest in robotics, surveillance, remote sensing, and medical imaging. While traditional rely on detecting and matching spatial features, they break down when faced with noisy, misaligned, or cross-modal data. Recent deep learning methods have improved robustness through learned representations, but remain constrained by their dependence on extensive training data and computational demands. We present Flow Intelligence, a paradigm-shifting approach that moves beyond spatial features by focusing on temporal motion patterns exclusively. Instead of detecting traditional keypoints, our method extracts motion signatures from pixel blocks across consecutive frames and extract temporal motion signatures between videos. These motion-based descriptors achieve natural invariance to translation, rotation, and scale variations while remaining robust across different imaging modalities. This novel approach also requires no pretraining data, eliminates the need for spatial feature detection, enables cross-modal matching using only temporal motion, and it outperforms existing methods in challenging scenarios where traditional approaches fail. By leveraging motion rather than appearance, Flow Intelligence enables robust, real-time video feature matching in diverse environments.

Via

Access Paper or Ask Questions

Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Jan 25, 2025

Fengpu Pan, Jiangtao Wen, Yuxing Han

Figure 1 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Figure 2 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Figure 3 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Figure 4 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Abstract:Snapshot compressive imaging (SCI) is a promising technique for capturing high-speed video at low bandwidth and low power, typically by compressing multiple frames into a single measurement. However, similar to traditional CMOS image sensor based imaging systems, SCI also faces challenges in low-lighting photon-limited and low-signal-to-noise-ratio image conditions. In this paper, we propose a novel Compressive Denoising Autoencoder (CompDAE) using the STFormer architecture as the backbone, to explicitly model noise characteristics and provide computer vision functionalities such as edge detection and depth estimation directly from compressed sensing measurements, while accounting for realistic low-photon conditions. We evaluate the effectiveness of CompDAE across various datasets and demonstrated significant improvements in task performance compared to conventional RGB-based methods. In the case of ultra-low-lighting (APC $\leq$ 20) while conventional methods failed, the proposed algorithm can still maintain competitive performance.

Via

Access Paper or Ask Questions

Efficient Video Neural Network Processing Based on Motion Estimation

Jan 25, 2025

Haichao Wang, Jiangtao Wen, Yuxing Han

Abstract:Video neural network (VNN) processing using the conventional pipeline first converts Bayer video information into human understandable RGB videos using image signal processing (ISP) on a pixel by pixel basis. Then, VNN processing is performed on a frame by frame basis. Both ISP and VNN are computationally expensive with high power consumption and latency. In this paper, we propose an efficient VNN processing framework. Instead of using ISP, computer vision tasks are directly accomplished using Bayer pattern information. To accelerate VNN processing, motion estimation is introduced to find temporal redundancies in input video data so as to avoid repeated and unnecessary computations. Experiments show greater than 67\% computation reduction, while maintaining computer vision task accuracy for typical computer vision tasks and data sets.

Via

Access Paper or Ask Questions

Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

Mar 13, 2024

Yuxing Han, Yunan Ding, Chen Ye Gan, Jiangtao Wen

Figure 1 for Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

Figure 2 for Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

Figure 3 for Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

Figure 4 for Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

Abstract:Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially when an immense volume of video content is being constantly generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream decoding. To validate our approach, we built a comprehensive data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our evaluations indicate precision, accuracy, and recall rates consistently above 80%, many exceeding 90%, and some reaching 99%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by seven orders of magnitude.

* 5 pages, 5 figures, 1 table. arXiv admin note: substantial text overlap with arXiv:2309.07361

Via

Access Paper or Ask Questions

Judging a video by its bitstream cover

Sep 14, 2023

Yuxing Han, Yunan Ding, Jiangtao Wen, Chen Ye Gan

Abstract:Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially in an age where an immense volume of video content is constantly being generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream. We validate our approach using a custom-built data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our preliminary evaluations indicate precision, accuracy, and recall rates well over 80%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by six orders of magnitude.

Via

Access Paper or Ask Questions

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Mar 13, 2023

Yubin Hu, Yuze He, Yanghao Li, Jisheng Li, Yuxing Han, Jiangtao Wen, Yong-Jin Liu

Figure 1 for Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Figure 2 for Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Figure 3 for Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Figure 4 for Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Abstract:Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.

* CVPR 2023

Via

Access Paper or Ask Questions

Phocus: Picking Valuable Research from a Sea of Citations

Jan 14, 2022

Xinrong Zhang, Zihou Ren, Xi Li, Shuqi Liu, Yunlong Deng, Yadi Xiao, Yuxing Han, Jiangtao Wen

Figure 1 for Phocus: Picking Valuable Research from a Sea of Citations

Figure 2 for Phocus: Picking Valuable Research from a Sea of Citations

Figure 3 for Phocus: Picking Valuable Research from a Sea of Citations

Figure 4 for Phocus: Picking Valuable Research from a Sea of Citations

Abstract:The deluge of new papers has significantly blocked the development of academics, which is mainly caused by author-level and publication-level evaluation metrics that only focus on quantity. Those metrics have resulted in several severe problems that trouble scholars focusing on the important research direction for a long time and even promote an impetuous academic atmosphere. To solve those problems, we propose Phocus, a novel academic evaluation mechanism for authors and papers. Phocus analyzes the sentence containing a citation and its contexts to predict the sentiment towards the corresponding reference. Combining others factors, Phocus classifies citations coarsely, ranks all references within a paper, and utilizes the results of the classifier and the ranking model to get the local influential factor of a reference to the citing paper. The global influential factor of the reference to the citing paper is the product of the local influential factor and the total influential factor of the citing paper. Consequently, an author's academic influential factor is the sum of his contributions to each paper he co-authors.

Via

Access Paper or Ask Questions

Novel tile segmentation scheme for omnidirectional video

Mar 10, 2021

Jisheng Li, Ziyu Wen, Sihan Li, Yikai Zhao, Bichuan Guo, Jiangtao Wen

Figure 1 for Novel tile segmentation scheme for omnidirectional video

Figure 2 for Novel tile segmentation scheme for omnidirectional video

Figure 3 for Novel tile segmentation scheme for omnidirectional video

Figure 4 for Novel tile segmentation scheme for omnidirectional video

Abstract:Regular omnidirectional video encoding technics use map projection to flatten a scene from a spherical shape into one or several 2D shapes. Common projection methods including equirectangular and cubic projection have varying levels of interpolation that create a large number of non-information-carrying pixels that lead to wasted bitrate. In this paper, we propose a tile based omnidirectional video segmentation scheme which can save up to 28% of pixel area and 20% of BD-rate averagely compared to the traditional equirectangular projection based approach.

* Published in 2016 IEEE International Conference on Image Processing (ICIP)

Via

Access Paper or Ask Questions

Learning to Estimate Kernel Scale and Orientation of Defocus Blur with Asymmetric Coded Aperture

Mar 10, 2021

Jisheng Li, Qi Dai, Jiangtao Wen

Figure 1 for Learning to Estimate Kernel Scale and Orientation of Defocus Blur with Asymmetric Coded Aperture

Figure 2 for Learning to Estimate Kernel Scale and Orientation of Defocus Blur with Asymmetric Coded Aperture

Figure 3 for Learning to Estimate Kernel Scale and Orientation of Defocus Blur with Asymmetric Coded Aperture

Figure 4 for Learning to Estimate Kernel Scale and Orientation of Defocus Blur with Asymmetric Coded Aperture

Abstract:Consistent in-focus input imagery is an essential precondition for machine vision systems to perceive the dynamic environment. A defocus blur severely degrades the performance of vision systems. To tackle this problem, we propose a deep-learning-based framework estimating the kernel scale and orientation of the defocus blur to adjust lens focus rapidly. Our pipeline utilizes 3D ConvNet for a variable number of input hypotheses to select the optimal slice from the input stack. We use random shuffle and Gumbel-softmax to improve network performance. We also propose to generate synthetic defocused images with various asymmetric coded apertures to facilitate training. Experiments are conducted to demonstrate the effectiveness of our framework.

Via

Access Paper or Ask Questions