Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fanyang Meng

Boosting Neural Video Representation via Online Structural Reparameterization

Nov 14, 2025

Ziyi Li, Qingyu Mao, Shuai Liu, Qilei Li, Fanyang Meng, Yongsheng Liang

Abstract:Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.

* The 8th Chinese Conference on Pattern Recognition and Computer Vision (PRCV 2025)
* 15 pages, 7 figures

Via

Access Paper or Ask Questions

Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction

May 22, 2025

Jiacong Chen, Qingyu Mao, Youneng Bao, Xiandong Meng, Fanyang Meng, Ronggang Wang, Yongsheng Liang

Abstract:3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we propose a novel Compact Gaussian Streaming (ComGS) framework, leveraging the locality and consistency of motion in dynamic scene, that models object-consistent Gaussian point motion through keypoint-driven motion representation. By transmitting only the keypoint attributes, this framework provides a more storage-efficient solution. Specifically, we first identify a sparse set of motion-sensitive keypoints localized within motion regions using a viewspace gradient difference strategy. Equipped with these keypoints, we propose an adaptive motion-driven mechanism that predicts a spatial influence field for propagating keypoint motion to neighboring Gaussian points with similar motion. Moreover, ComGS adopts an error-aware correction strategy for key frame reconstruction that selectively refines erroneous regions and mitigates error accumulation without unnecessary overhead. Overall, ComGS achieves a remarkable storage reduction of over 159 X compared to 3DGStream and 14 X compared to the SOTA method QUEEN, while maintaining competitive visual fidelity and rendering speed. Our code will be released.

* 17 pages, 9 figures

Via

Access Paper or Ask Questions

Content-Distortion High-Order Interaction for Blind Image Quality Assessment

Apr 07, 2025

Shuai Liu, Qingyu Mao, Chao Li, Jiacong Chen, Fanyang Meng, Yonghong Tian, Yongsheng Liang

Figure 1 for Content-Distortion High-Order Interaction for Blind Image Quality Assessment

Figure 2 for Content-Distortion High-Order Interaction for Blind Image Quality Assessment

Figure 3 for Content-Distortion High-Order Interaction for Blind Image Quality Assessment

Figure 4 for Content-Distortion High-Order Interaction for Blind Image Quality Assessment

Abstract:The content and distortion are widely recognized as the two primary factors affecting the visual quality of an image. While existing No-Reference Image Quality Assessment (NR-IQA) methods have modeled these factors, they fail to capture the complex interactions between content and distortions. This shortfall impairs their ability to accurately perceive quality. To confront this, we analyze the key properties required for interaction modeling and propose a robust NR-IQA approach termed CoDI-IQA (Content-Distortion high-order Interaction for NR-IQA), which aggregates local distortion and global content features within a hierarchical interaction framework. Specifically, a Progressive Perception Interaction Module (PPIM) is proposed to explicitly simulate how content and distortions independently and jointly influence image quality. By integrating internal interaction, coarse interaction, and fine interaction, it achieves high-order interaction modeling that allows the model to properly represent the underlying interaction patterns. To ensure sufficient interaction, multiple PPIMs are employed to hierarchically fuse multi-level content and distortion features at different granularities. We also tailor a training strategy suited for CoDI-IQA to maintain interaction stability. Extensive experiments demonstrate that the proposed method notably outperforms the state-of-the-art methods in terms of prediction accuracy, data efficiency, and generalization ability.

* 19 pages (main text: 14 pages + appendix: 5 pages), 9 figures, 23 tables. In submission

Via

Access Paper or Ask Questions

Enhancing Adversarial Training with Prior Knowledge Distillation for Robust Image Compression

Mar 16, 2024

Zhi Cao, Youneng Bao, Fanyang Meng, Chao Li, Wen Tan, Genhong Wang, Yongsheng Liang

Figure 1 for Enhancing Adversarial Training with Prior Knowledge Distillation for Robust Image Compression

Figure 2 for Enhancing Adversarial Training with Prior Knowledge Distillation for Robust Image Compression

Figure 3 for Enhancing Adversarial Training with Prior Knowledge Distillation for Robust Image Compression

Figure 4 for Enhancing Adversarial Training with Prior Knowledge Distillation for Robust Image Compression

Abstract:Deep neural network-based image compression (NIC) has achieved excellent performance, but NIC method models have been shown to be susceptible to backdoor attacks. Adversarial training has been validated in image compression models as a common method to enhance model robustness. However, the improvement effect of adversarial training on model robustness is limited. In this paper, we propose a prior knowledge-guided adversarial training framework for image compression models. Specifically, first, we propose a gradient regularization constraint for training robust teacher models. Subsequently, we design a knowledge distillation based strategy to generate a priori knowledge from the teacher model to the student model for guiding adversarial training. Experimental results show that our method improves the reconstruction quality by about 9dB when the Kodak dataset is elected as the backdoor attack object for psnr attack. Compared with Ma2023, our method has a 5dB higher PSNR output at high bitrate points.

Via

Access Paper or Ask Questions

Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

Feb 04, 2024

Mengyuan Liu, Chen Chen, Songtao Wu, Fanyang Meng, Hong Liu

Figure 1 for Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

Figure 2 for Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

Figure 3 for Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

Figure 4 for Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition

Abstract:Recognizing interactive actions, including hand-to-hand interaction and human-to-human interaction, has attracted increasing attention for various applications in the field of video analysis and human-robot interaction. Considering the success of graph convolution in modeling topology-aware features from skeleton data, recent methods commonly operate graph convolution on separate entities and use late fusion for interactive action recognition, which can barely model the mutual semantic relationships between pairwise entities. To this end, we propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution (me-GC) layers. Specifically, me-GC uses a mutual topology excitation module to firstly extract adjacency matrices from individual entities and then adaptively model the mutual constraints between them. Moreover, me-GC extends the above idea and further uses a mutual feature excitation module to extract and merge deep features from pairwise entities. Compared with graph convolution, our proposed me-GC gradually learns mutual information in each layer and each stage of graph convolution operations. Extensive experiments on a challenging hand-to-hand interaction dataset, i.e., the Assembely101 dataset, and two large-scale human-to-human interaction datasets, i.e., NTU60-Interaction and NTU120-Interaction consistently verify the superiority of our proposed method, which outperforms the state-of-the-art GCN-based and Transformer-based methods.

Via

Access Paper or Ask Questions

Explore Human Parsing Modality for Action Recognition

Jan 04, 2024

Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Shen Zhao, Mengyuan Liu

Figure 1 for Explore Human Parsing Modality for Action Recognition

Figure 2 for Explore Human Parsing Modality for Action Recognition

Abstract:Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts, while filtering out most irrelevant noise. We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing festures via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at: https://github.com/liujf69/EPP-Net-Action.

* arXiv admin note: text overlap with arXiv:2307.07977

Via

Access Paper or Ask Questions

Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

Nov 23, 2023

Wanying Zhang, Shen Zhao, Fanyang Meng, Songtao Wu, Mengyuan Liu

Figure 1 for Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

Figure 2 for Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

Figure 3 for Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

Figure 4 for Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

Abstract:With potential applications in fields including intelligent surveillance and human-robot interaction, the human motion prediction task has become a hot research topic and also has achieved high success, especially using the recent Graph Convolutional Network (GCN). Current human motion prediction task usually focuses on predicting human motions for atomic actions. Observing that atomic actions can happen at the same time and thus formulating the composite actions, we propose the composite human motion prediction task. To handle this task, we first present a Composite Action Generation (CAG) module to generate synthetic composite actions for training, thus avoiding the laborious work of collecting composite action samples. Moreover, we alleviate the effect of composite actions on demand for a more complicated model by presenting a Dynamic Compositional Graph Convolutional Network (DC-GCN). Extensive experiments on the Human3.6M dataset and our newly collected CHAMP dataset consistently verify the efficiency of our DC-GCN method, which achieves state-of-the-art motion prediction accuracies and meanwhile needs few extra computational costs than traditional GCN-based human motion methods.

* Proceedings of the 31st ACM International Conference on Multimedia, October 2023, Pages 2856-2864

Via

Access Paper or Ask Questions

Integrating Human Parsing and Pose Network for Human Action Recognition

Jul 16, 2023

Runwei Ding, Yuhang Wen, Jinfu Liu, Nan Dai, Fanyang Meng, Mengyuan Liu

Figure 1 for Integrating Human Parsing and Pose Network for Human Action Recognition

Figure 2 for Integrating Human Parsing and Pose Network for Human Action Recognition

Figure 3 for Integrating Human Parsing and Pose Network for Human Action Recognition

Figure 4 for Integrating Human Parsing and Pose Network for Human Action Recognition

Abstract:Human skeletons and RGB sequences are both widely-adopted input modalities for human action recognition. However, skeletons lack appearance features and color data suffer large amount of irrelevant depiction. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain spatiotemporal features of the body parts, while filtering out noises regarding outfits, backgrounds, etc. We propose an Integrating Human Parsing and Pose Network (IPP-Net) for action recognition, which is the first to leverage both skeletons and human parsing feature maps in dual-branch approach. The human pose branch feeds compact skeletal representations of different modalities in graph convolutional network to model pose features. In human parsing branch, multi-frame body-part parsing features are extracted with human detector and parser, which is later learnt using a convolutional backbone. A late ensemble of two branches is adopted to get final predictions, considering both robust keypoints and rich semantic body-part features. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods. Our code is publicly available at https://github.com/liujf69/IPP-Net-Parsing .

* CICAI 2023 Camera-ready Version

Via

Access Paper or Ask Questions

Universal Learned Image Compression With Low Computational Cost

Jun 23, 2022

Bowen Li, Yao Xin, Youneng Bao, Fanyang Meng, Yongsheng Liang, Wen Tan

Figure 1 for Universal Learned Image Compression With Low Computational Cost

Figure 2 for Universal Learned Image Compression With Low Computational Cost

Figure 3 for Universal Learned Image Compression With Low Computational Cost

Figure 4 for Universal Learned Image Compression With Low Computational Cost

Abstract:Recently, learned image compression methods have developed rapidly and exhibited excellent rate-distortion performance when compared to traditional standards, such as JPEG, JPEG2000 and BPG. However, the learning-based methods suffer from high computational costs, which is not beneficial for deployment on devices with limited resources. To this end, we propose shift-addition parallel modules (SAPMs), including SAPM-E for the encoder and SAPM-D for the decoder, to largely reduce the energy consumption. To be specific, they can be taken as plug-and-play components to upgrade existing CNN-based architectures, where the shift branch is used to extract large-grained features as compared to small-grained features learned by the addition branch. Furthermore, we thoroughly analyze the probability distribution of latent representations and propose to use Laplace Mixture Likelihoods for more accurate entropy estimation. Experimental results demonstrate that the proposed methods can achieve comparable or even better performance on both PSNR and MS-SSIM metrics to that of the convolutional counterpart with an about 2x energy reduction.

* 5 pages

Via

Access Paper or Ask Questions

Context-aware Visual Tracking with Joint Meta-updating

Apr 04, 2022

Qiuhong Shen, Xin Li, Fanyang Meng, Yongsheng Liang

Figure 1 for Context-aware Visual Tracking with Joint Meta-updating

Figure 2 for Context-aware Visual Tracking with Joint Meta-updating

Figure 3 for Context-aware Visual Tracking with Joint Meta-updating

Figure 4 for Context-aware Visual Tracking with Joint Meta-updating

Abstract:Visual object tracking acts as a pivotal component in various emerging video applications. Despite the numerous developments in visual tracking, existing deep trackers are still likely to fail when tracking against objects with dramatic variation. These deep trackers usually do not perform online update or update single sub-branch of the tracking model, for which they cannot adapt to the appearance variation of objects. Efficient updating methods are therefore crucial for tracking while previous meta-updater optimizes trackers directly over parameter space, which is prone to over-fit even collapse on longer sequences. To address these issues, we propose a context-aware tracking model to optimize the tracker over the representation space, which jointly meta-update both branches by exploiting information along the whole sequence, such that it can avoid the over-fitting problem. First, we note that the embedded features of the localization branch and the box-estimation branch, focusing on the local and global information of the target, are effective complements to each other. Based on this insight, we devise a context-aggregation module to fuse information in historical frames, followed by a context-aware module to learn affinity vectors for both branches of the tracker. Besides, we develop a dedicated meta-learning scheme, on account of fast and stable updating with limited training samples. The proposed tracking method achieves an EAO score of 0.514 on VOT2018 with the speed of 40FPS, demonstrating its capability of improving the accuracy and robustness of the underlying tracker with little speed drop.

* 9 pages, 8 figures

Via

Access Paper or Ask Questions