Abstract:Removing blur caused by moving objects is challenging, as the moving objects are usually significantly blurry while the static background remains clear. Existing methods that rely on local blur detection often suffer from inaccuracies and cannot generate satisfactory results when focusing solely on blurred regions. To overcome these problems, we first design a context-based local blur detection module that incorporates additional contextual information to improve the identification of blurry regions. Considering that modern smartphones are equipped with cameras capable of providing short-exposure images, we develop a blur-aware guided image restoration method that utilizes sharp structural details from short-exposure images, facilitating accurate reconstruction of heavily blurred regions. Furthermore, to restore images realistically and visually-pleasant, we develop a short-exposure guided diffusion model that explores useful features from short-exposure images and blurred regions to better constrain the diffusion process. Finally, we formulate the above components into a simple yet effective network, named ExpRDiff. Experimental results show that ExpRDiff performs favorably against state-of-the-art methods.
Abstract:Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.
Abstract:Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.
Abstract:Most existing super-resolution methods and datasets have been developed to improve the image quality in well-lighted conditions. However, these methods do not work well in real-world low-light conditions as the images captured in such conditions lose most important information and contain significant unknown noises. To solve this problem, we propose a SRRIIE dataset with an efficient conditional diffusion probabilistic models-based method. The proposed dataset contains 4800 paired low-high quality images. To ensure that the dataset are able to model the real-world image degradation in low-illumination environments, we capture images using an ILDC camera and an optical zoom lens with exposure levels ranging from -6 EV to 0 EV and ISO levels ranging from 50 to 12800. We comprehensively evaluate with various reconstruction and perceptual metrics and demonstrate the practicabilities of the SRRIIE dataset for deep learning-based methods. We show that most existing methods are less effective in preserving the structures and sharpness of restored images from complicated noises. To overcome this problem, we revise the condition for Raw sensor data and propose a novel time-melding condition for diffusion probabilistic model. Comprehensive quantitative and qualitative experimental results on the real-world benchmark datasets demonstrate the feasibility and effectivenesses of the proposed conditional diffusion probabilistic model on Raw sensor data. Code and dataset will be available at https://github.com/Yaofang-Liu/Super-Resolving
Abstract:Existing video super-resolution (VSR) methods generally adopt a recurrent propagation network to extract spatio-temporal information from the entire video sequences, exhibiting impressive performance. However, the key components in recurrent-based VSR networks significantly impact model efficiency, e.g., the alignment module occupies a substantial portion of model parameters, while the bidirectional propagation mechanism significantly amplifies the inference time. Consequently, developing a compact and efficient VSR method that can be deployed on resource-constrained devices, e.g., smartphones, remains challenging. To this end, we propose a cascaded temporal updating network (CTUN) for efficient VSR. We first develop an implicit cascaded alignment module to explore spatio-temporal correspondences from adjacent frames. Moreover, we propose a unidirectional propagation updating network to efficiently explore long-range temporal information, which is crucial for high-quality video reconstruction. Specifically, we develop a simple yet effective hidden updater that can leverage future information to update hidden features during forward propagation, significantly reducing inference time while maintaining performance. Finally, we formulate all of these components into an end-to-end trainable VSR network. Extensive experimental results show that our CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods. Notably, compared with BasicVSR, our method obtains better results while employing only about 30% of the parameters and running time. The source code and pre-trained models will be available at https://github.com/House-Leo/CTUN.
Abstract:Ultra-High-Definition (UHD) image restoration has acquired remarkable attention due to its practical demand. In this paper, we construct UHD snow and rain benchmarks, named UHD-Snow and UHD-Rain, to remedy the deficiency in this field. The UHD-Snow/UHD-Rain is established by simulating the physics process of rain/snow into consideration and each benchmark contains 3200 degraded/clear image pairs of 4K resolution. Furthermore, we propose an effective UHD image restoration solution by considering gradient and normal priors in model design thanks to these priors' spatial and detail contributions. Specifically, our method contains two branches: (a) feature fusion and reconstruction branch in high-resolution space and (b) prior feature interaction branch in low-resolution space. The former learns high-resolution features and fuses prior-guided low-resolution features to reconstruct clear images, while the latter utilizes normal and gradient priors to mine useful spatial features and detail features to guide high-resolution recovery better. To better utilize these priors, we introduce single prior feature interaction and dual prior feature interaction, where the former respectively fuses normal and gradient priors with high-resolution features to enhance prior ones, while the latter calculates the similarity between enhanced prior ones and further exploits dual guided filtering to boost the feature interaction of dual priors. We conduct experiments on both new and existing public datasets and demonstrate the state-of-the-art performance of our method on UHD image low-light enhancement, UHD image desonwing, and UHD image deraining. The source codes and benchmarks are available at \url{https://github.com/wlydlut/UHDDIP}.
Abstract:The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.
Abstract:This paper proposes UHDformer, a general Transformer for Ultra-High-Definition (UHD) image restoration. UHDformer contains two learning spaces: (a) learning in high-resolution space and (b) learning in low-resolution space. The former learns multi-level high-resolution features and fuses low-high features and reconstructs the residual images, while the latter explores more representative features learning from the high-resolution ones to facilitate better restoration. To better improve feature representation in low-resolution space, we propose to build feature transformation from the high-resolution space to the low-resolution one. To that end, we propose two new modules: Dual-path Correlation Matching Transformation module (DualCMT) and Adaptive Channel Modulator (ACM). The DualCMT selects top C/r (r is greater or equal to 1 which controls the squeezing level) correlation channels from the max-pooling/mean-pooling high-resolution features to replace low-resolution ones in Transformers, which can effectively squeeze useless content to improve the feature representation in low-resolution space to facilitate better recovery. The ACM is exploited to adaptively modulate multi-level high-resolution features, enabling to provide more useful features to low-resolution space for better learning. Experimental results show that our UHDformer reduces about ninety-seven percent model sizes compared with most state-of-the-art methods while significantly improving performance under different training sets on 3 UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes will be made available at https://github.com/supersupercong/UHDformer.
Abstract:Despite significant progress has been made in image deraining, existing approaches are mostly carried out on low-resolution images. The effectiveness of these methods on high-resolution images is still unknown, especially for ultra-high-definition (UHD) images, given the continuous advancement of imaging devices. In this paper, we focus on the task of UHD image deraining, and contribute the first large-scale UHD image deraining dataset, 4K-Rain13k, that contains 13,000 image pairs at 4K resolution. Based on this dataset, we conduct a benchmark study on existing methods for processing UHD images. Furthermore, we develop an effective and efficient vision MLP-based architecture (UDR-Mixer) to better solve this task. Specifically, our method contains two building components: a spatial feature rearrangement layer that captures long-range information of UHD images, and a frequency feature modulation layer that facilitates high-quality UHD image reconstruction. Extensive experimental results demonstrate that our method performs favorably against the state-of-the-art approaches while maintaining a lower model complexity. The code and dataset will be available at https://github.com/cschenxiang/UDR-Mixer.
Abstract:Diffusion models significantly improve the quality of super-resolved images with their impressive content generation capabilities. However, the huge computational costs limit the applications of these methods.Recent efforts have explored reasonable inference acceleration to reduce the number of sampling steps, but the computational cost remains high as each step is performed on the entire image.This paper introduces PatchScaler, a patch-independent diffusion-based single image super-resolution (SR) method, designed to enhance the efficiency of the inference process.The proposed method is motivated by the observation that not all the image patches within an image need the same sampling steps for reconstructing high-resolution images.Based on this observation, we thus develop a Patch-adaptive Group Sampling (PGS) to divide feature patches into different groups according to the patch-level reconstruction difficulty and dynamically assign an appropriate sampling configuration for each group so that the inference speed can be better accelerated.In addition, to improve the denoising ability at each step of the sampling, we develop a texture prompt to guide the estimations of the diffusion model by retrieving high-quality texture priors from a patch-independent reference texture memory.Experiments show that our PatchScaler achieves favorable performance in both quantitative and qualitative evaluations with fast inference speed.Our code and model are available at \url{https://github.com/yongliuy/PatchScaler}.