Abstract:Real-world image super-resolution (Real-ISR) aims at restoring high-quality (HQ) images from low-quality (LQ) inputs corrupted by unknown and complex degradations. In particular, pretrained text-to-image (T2I) diffusion models provide strong generative priors to reconstruct credible and intricate details. However, T2I generation focuses on semantic consistency while Real-ISR emphasizes pixel-level reconstruction, which hinders existing methods from fully exploiting diffusion priors. To address this challenge, we introduce ConsisSR to handle both semantic and pixel-level consistency. Specifically, compared to coarse-grained text prompts, we exploit the more powerful CLIP image embedding and effectively leverage both modalities through our Hybrid Prompt Adapter (HPA) for semantic guidance. Secondly, we introduce Time-aware Latent Augmentation (TALA) to mitigate the inherent gap between T2I generation and Real-ISR consistency requirements. By randomly mixing LQ and HQ latent inputs, our model not only handle timestep-specific diffusion noise but also refine the accumulated latent representations. Last but not least, our GAN-Embedding strategy employs the pretrained Real-ESRGAN model to refine the diffusion start point. This accelerates the inference process to 10 steps while preserving sampling quality, in a training-free manner. Our method demonstrates state-of-the-art performance among both full-scale and accelerated models. The code will be made publicly available.
Abstract:Due to the limitations of capture devices and scenarios, egocentric videos frequently have low visual quality, mainly caused by high compression and severe motion blur. With the increasing application of egocentric videos, there is an urgent need to enhance the quality of these videos through super-resolution. However, existing Video Super-Resolution (VSR) works, focusing on third-person view videos, are actually unsuitable for handling blurring artifacts caused by rapid ego-motion and object motion in egocentric videos. To this end, we propose EgoVSR, a VSR framework specifically designed for egocentric videos. We explicitly tackle motion blurs in egocentric videos using a Dual Branch Deblur Network (DB$^2$Net) in the VSR framework. Meanwhile, a blurring mask is introduced to guide the DB$^2$Net learning, and can be used to localize blurred areas in video frames. We also design a MaskNet to predict the mask, as well as a mask loss to optimize the mask estimation. Additionally, an online motion blur synthesis model for common VSR training data is proposed to simulate motion blurs as in egocentric videos. In order to validate the effectiveness of our proposed method, we introduce an EgoVSR dataset containing a large amount of fast-motion egocentric video sequences. Extensive experiments demonstrate that our EgoVSR model can efficiently super-resolve low-quality egocentric videos and outperform strong comparison baselines. Our code, pre-trained models, and data will be released.