Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lihuo He

FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution

Jun 13, 2025

Zhaoyang Wang, Jie Li, Wen Lu, Lihuo He, Maoguo Gong, Xinbo Gao

Abstract:State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information. As video frame rates continue to increase, the diminishing inter-frame differences further expose the limitations of traditional frame-to-frame information exploitation methods, which are inadequate for addressing current video super-resolution (VSR) demands. To overcome these challenges, we propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data. Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames. The proposed modular architecture is designed for seamless integration with existing VSR frameworks, ensuring strong adaptability and transferability across diverse applications. Experimental results demonstrate that our method achieves performance on par with, or surpassing, the current SOTA models, while significantly reducing inference time. By addressing key bottlenecks in CVSR, our work offers a practical and efficient pathway for advancing VSR technology. Our code will be publicly available at https://github.com/handsomewzy/FCA2.

* This work has been submitted to the IEEE TMM for possible publication

Via

Access Paper or Ask Questions

EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment

Jun 13, 2025

Zhaoyang Wang, Wen Lu, Jie Li, Lihuo He, Maoguo Gong, Xinbo Gao

Abstract:Free-energy-guided self-repair mechanisms have shown promising results in image quality assessment (IQA), but remain under-explored in video quality assessment (VQA), where temporal dynamics and model constraints pose unique challenges. Unlike static images, video content exhibits richer spatiotemporal complexity, making perceptual restoration more difficult. Moreover, VQA systems often rely on pre-trained backbones, which limits the direct integration of enhancement modules without affecting model stability. To address these issues, we propose EyeSimVQA, a novel VQA framework that incorporates free-energy-based self-repair. It adopts a dual-branch architecture, with an aesthetic branch for global perceptual evaluation and a technical branch for fine-grained structural and semantic analysis. Each branch integrates specialized enhancement modules tailored to distinct visual inputs-resized full-frame images and patch-based fragments-to simulate adaptive repair behaviors. We also explore a principled strategy for incorporating high-level visual features without disrupting the original backbone. In addition, we design a biologically inspired prediction head that models sweeping gaze dynamics to better fuse global and local representations for quality prediction. Experiments on five public VQA benchmarks demonstrate that EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods, while offering improved interpretability through its biologically grounded design.

* This work has been submitted to the IEEE TCSVT for possible publication

Via

Access Paper or Ask Questions

A Multi-annotated and Multi-modal Dataset for Wide-angle Video Quality Assessment

Jan 21, 2025

Bo Hu, Wei Wang, Chunyi Li, Lihuo He, Leida Li, Xinbo Gao

Abstract:Wide-angle video is favored for its wide viewing angle and ability to capture a large area of scenery, making it an ideal choice for sports and adventure recording. However, wide-angle video is prone to deformation, exposure and other distortions, resulting in poor video quality and affecting the perception and experience, which may seriously hinder its application in fields such as competitive sports. Up to now, few explorations focus on the quality assessment issue of wide-angle video. This deficiency primarily stems from the absence of a specialized dataset for wide-angle videos. To bridge this gap, we construct the first Multi-annotated and multi-modal Wide-angle Video quality assessment (MWV) dataset. Then, the performances of state-of-the-art video quality methods on the MWV dataset are investigated by inter-dataset testing and intra-dataset testing. Experimental results show that these methods impose significant limitations on their applicability.

Via

Access Paper or Ask Questions

CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Dec 13, 2024

Kaifan Zhang, Lihuo He, Xin Jiang, Wen Lu, Di Wang, Xinbo Gao

Figure 1 for CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Figure 2 for CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Figure 3 for CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Figure 4 for CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Abstract:Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable ``beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address this limitation, we propose CognitionCapturer, a unified framework that fully leverages multimodal data to represent EEG signals. Specifically, CognitionCapturer trains Modality Expert Encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively. Code: https://github.com/XiaoZhangYES/CognitionCapturer.

Via

Access Paper or Ask Questions

AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Nov 25, 2024

Jili Xia, Lihuo He, Fei Gao, Kaifan Zhang, Leida Li, Xinbo Gao

Figure 1 for AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Figure 2 for AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Figure 3 for AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Figure 4 for AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Abstract:Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model's performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.

Via

Access Paper or Ask Questions

Feature Erasing and Diffusion Network for Occluded Person Re-Identification

Dec 16, 2021

Zhikang Wang, Feng Zhu, Shixiang Tang, Rui Zhao, Lihuo He, Jiangning Song

Figure 1 for Feature Erasing and Diffusion Network for Occluded Person Re-Identification

Figure 2 for Feature Erasing and Diffusion Network for Occluded Person Re-Identification

Figure 3 for Feature Erasing and Diffusion Network for Occluded Person Re-Identification

Figure 4 for Feature Erasing and Diffusion Network for Occluded Person Re-Identification

Abstract:Occluded person re-identification (ReID) aims at matching occluded person images to holistic ones across different camera views. Target Pedestrians (TP) are usually disturbed by Non-Pedestrian Occlusions (NPO) and NonTarget Pedestrians (NTP). Previous methods mainly focus on increasing model's robustness against NPO while ignoring feature contamination from NTP. In this paper, we propose a novel Feature Erasing and Diffusion Network (FED) to simultaneously handle NPO and NTP. Specifically, NPO features are eliminated by our proposed Occlusion Erasing Module (OEM), aided by the NPO augmentation strategy which simulates NPO on holistic pedestrian images and generates precise occlusion masks. Subsequently, we Subsequently, we diffuse the pedestrian representations with other memorized features to synthesize NTP characteristics in the feature space which is achieved by a novel Feature Diffusion Module (FDM) through a learnable cross attention mechanism. With the guidance of the occlusion scores from OEM, the feature diffusion process is mainly conducted on visible body parts, which guarantees the quality of the synthesized NTP characteristics. By jointly optimizing OEM and FDM in our proposed FED network, we can greatly improve the model's perception ability towards TP and alleviate the influence of NPO and NTP. Furthermore, the proposed FDM only works as an auxiliary module for training and will be discarded in the inference phase, thus introducing little inference computational overhead. Experiments on occluded and holistic person ReID benchmarks demonstrate the superiority of FED over state-of-the-arts, where FED achieves 86.3% Rank-1 accuracy on Occluded-REID, surpassing others by at least 4.7%.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Robust Person Re-Identification through Contextual Mutual Boosting

Sep 16, 2020

Zhikang Wang, Lihuo He, Xinbo Gao, Jane Shen

Figure 1 for Robust Person Re-Identification through Contextual Mutual Boosting

Figure 2 for Robust Person Re-Identification through Contextual Mutual Boosting

Figure 3 for Robust Person Re-Identification through Contextual Mutual Boosting

Figure 4 for Robust Person Re-Identification through Contextual Mutual Boosting

Abstract:Person Re-Identification (Re-ID) has witnessed great advance, driven by the development of deep learning. However, modern person Re-ID is still challenged by background clutter, occlusion and large posture variation which are common in practice. Previous methods tackle these challenges by localizing pedestrians through external cues (e.g., pose estimation, human parsing) or attention mechanism, suffering from high computation cost and increased model complexity. In this paper, we propose the Contextual Mutual Boosting Network (CMBN). It localizes pedestrians and recalibrates features by effectively exploiting contextual information and statistical inference. Firstly, we construct two branches with a shared convolutional frontend to learn the foreground and background features respectively. By enabling interaction between these two branches, they boost the accuracy of the spatial localization mutually. Secondly, starting from a statistical perspective, we propose the Mask Generator that exploits the activation distribution of the transformation matrix for generating the static channel mask to the representations. The mask recalibrates the features to amplify the valuable characteristics and diminish the noise. Finally, we propose the Contextual-Detachment Strategy to optimize the two branches jointly and independently, which further enhances the localization precision. Experiments on the benchmarks demonstrate the superiority of the architecture compared the state-of-the-art.

* 8 pages, 4 figures, conference

Via

Access Paper or Ask Questions

A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction

Dec 19, 2018

Xiaodan Zhang, Xinbo Gao, Wen Lu, Lihuo He

Figure 1 for A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction

Figure 2 for A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction

Figure 3 for A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction

Figure 4 for A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction

Abstract:Learning fine-grained details is a key issue in image aesthetic assessment. Most of the previous methods extract the fine-grained details via random cropping strategy, which may undermine the integrity of semantic information. Extensive studies show that humans perceive fine-grained details with a mixture of foveal vision and peripheral vision. Fovea has the highest possible visual acuity and is responsible for seeing the details. The peripheral vision is used for perceiving the broad spatial scene and selecting the attended regions for the fovea. Inspired by these observations, we propose a Gated Peripheral-Foveal Convolutional Neural Network (GPF-CNN). It is a dedicated double-subnet neural network, i.e. a peripheral subnet and a foveal subnet. The former aims to mimic the functions of peripheral vision to encode the holistic information and provide the attended regions. The latter aims to extract fine-grained features on these key regions. Considering that the peripheral vision and foveal vision play different roles in processing different visual stimuli, we further employ a gated information fusion (GIF) network to weight their contributions. The weights are determined through the fully connected layers followed by a sigmoid function. We conduct comprehensive experiments on the standard AVA and Photo.net datasets for unified aesthetic prediction tasks: (i) aesthetic quality classification; (ii) aesthetic score regression; and (iii) aesthetic score distribution prediction. The experimental results demonstrate the effectiveness of the proposed method.

* 11 pages

Via

Access Paper or Ask Questions