Abstract:Window-based transformers have demonstrated outstanding performance in super-resolution tasks due to their adaptive modeling capabilities through local self-attention (SA). However, they exhibit higher computational complexity and inference latency than convolutional neural networks. In this paper, we first identify that the adaptability of the Transformers is derived from their adaptive spatial aggregation and advanced structural design, while their high latency results from the computational costs and memory layout transformations associated with the local SA. To simulate this aggregation approach, we propose an effective convolution-based linear focal separable attention (FSA), allowing for long-range dynamic modeling with linear complexity. Additionally, we introduce an effective dual-branch structure combined with an ultra-lightweight information exchange module (IEM) to enhance the aggregation of information by the Token Mixer. Finally, with respect to the structure, we modify the existing spatial-gate-based feedforward neural networks by incorporating a self-gate mechanism to preserve high-dimensional channel information, enabling the modeling of more complex relationships. With these advancements, we construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet). Extensive experiments demonstrate that LAMNet achieves better performance than existing SA-based Transformer methods while maintaining the computational efficiency of convolutional neural networks, which can achieve a \(3\times\) speedup of inference time. The code will be publicly available at: https://github.com/zononhzy/LAMNet.
Abstract:Remote-sensing (RS) image compression at extremely low bitrates has always been a challenging task in practical scenarios like edge device storage and narrow bandwidth transmission. Generative models including VAEs and GANs have been explored to compress RS images into extremely low-bitrate streams. However, these generative models struggle to reconstruct visually plausible images due to the highly ill-posed nature of extremely low-bitrate image compression. To this end, we propose an image compression framework that utilizes a pre-trained diffusion model with powerful natural image priors to achieve high-realism reconstructions. However, diffusion models tend to hallucinate small structures and textures due to the significant information loss at limited bitrates. Thus, we introduce vector maps as semantic and structural guidance and propose a novel image compression approach named Map-Assisted Generative Compression (MAGC). MAGC employs a two-stage pipeline to compress and decompress RS images at extremely low bitrates. The first stage maps an image into a latent representation, which is then further compressed in a VAE architecture to save bitrates and serves as implicit guidance in the subsequent diffusion process. The second stage conducts a conditional diffusion model to generate a visually pleasing and semantically accurate result using implicit guidance and explicit semantic guidance. Quantitative and qualitative comparisons show that our method outperforms standard codecs and other learning-based methods in terms of perceptual quality and semantic accuracy. The dataset and code will be publicly available at https://github.com/WHUyyx/MAGC.
Abstract:Image rescaling aims to learn the optimal downscaled low-resolution (LR) image that can be accurately reconstructed to its original high-resolution (HR) counterpart. This process is crucial for efficient image processing and storage, especially in the era of ultra-high definition media. However, extreme downscaling factors pose significant challenges due to the highly ill-posed nature of the inverse upscaling process, causing existing methods to struggle in generating semantically plausible structures and perceptually rich textures. In this work, we propose a novel framework called Latent Space Based Image Rescaling (LSBIR) for extreme image rescaling tasks. LSBIR effectively leverages powerful natural image priors learned by a pre-trained text-to-image diffusion model to generate realistic HR images. The rescaling is performed in the latent space of a pre-trained image encoder and decoder, which offers better perceptual reconstruction quality due to its stronger sparsity and richer semantics. LSBIR adopts a two-stage training strategy. In the first stage, a pseudo-invertible encoder-decoder models the bidirectional mapping between the latent features of the HR image and the target-sized LR image. In the second stage, the reconstructed features from the first stage are refined by a pre-trained diffusion model to generate more faithful and visually pleasing details. Extensive experiments demonstrate the superiority of LSBIR over previous methods in both quantitative and qualitative evaluations. The code will be available at: https://github.com/wwangcece/LSBIR.
Abstract:Remote sensing images captured by different platforms exhibit significant disparities in spatial resolution. Large scale factor super-resolution (SR) algorithms are vital for maximizing the utilization of low-resolution (LR) satellite data captured from orbit. However, existing methods confront challenges in recovering SR images with clear textures and correct ground objects. We introduce a novel framework, the Semantic Guided Diffusion Model (SGDM), designed for large scale factor remote sensing image super-resolution. The framework exploits a pre-trained generative model as a prior to generate perceptually plausible SR images. We further enhance the reconstruction by incorporating vector maps, which carry structural and semantic cues. Moreover, pixel-level inconsistencies in paired remote sensing images, stemming from sensor-specific imaging characteristics, may hinder the convergence of the model and diversity in generated results. To address this problem, we propose to extract the sensor-specific imaging characteristics and model the distribution of them, allowing diverse SR images generation based on imaging characteristics provided by reference images or sampled from the imaging characteristic probability distributions. To validate and evaluate our approach, we create the Cross-Modal Super-Resolution Dataset (CMSRD). Qualitative and quantitative experiments on CMSRD showcase the superiority and broad applicability of our method. Experimental results on downstream vision tasks also demonstrate the utilitarian of the generated SR images. The dataset and code will be publicly available at https://github.com/wwangcece/SGDM
Abstract:Learning based single image super-resolution (SISR) for real-world images has been an active research topic yet a challenging task, due to the lack of paired low-resolution (LR) and high-resolution (HR) training images. Most of the existing unsupervised real-world SISR methods adopt a two-stage training strategy by synthesizing realistic LR images from their HR counterparts first, then training the super-resolution (SR) models in a supervised manner. However, the training of image degradation and SR models in this strategy are separate, ignoring the inherent mutual dependency between downscaling and its inverse upscaling process. Additionally, the ill-posed nature of image degradation is not fully considered. In this paper, we propose an image downscaling and SR model dubbed as SDFlow, which simultaneously learns a bidirectional many-to-many mapping between real-world LR and HR images unsupervisedly. The main idea of SDFlow is to decouple image content and degradation information in the latent space, where content information distribution of LR and HR images is matched in a common latent space. Degradation information of the LR images and the high-frequency information of the HR images are fitted to an easy-to-sample conditional distribution. Experimental results on real-world image SR datasets indicate that SDFlow can generate diverse realistic LR and SR images both quantitatively and qualitatively.
Abstract:Deep convolutional neural network based image super-resolution (SR) models have shown superior performance in recovering the underlying high resolution (HR) images from low resolution (LR) images obtained from the predefined downscaling methods. In this paper we propose a learned image downscaling method based on content adaptive resampler (CAR) with consideration on the upscaling process. The proposed resampler network generates content adaptive image resampling kernels that are applied to the original HR input to generate pixels on the downscaled image. Moreover, a differentiable upscaling (SR) module is employed to upscale the LR result into its underlying HR counterpart. By back-propagating the reconstruction error down to the original HR input across the entire framework to adjust model parameters, the proposed framework achieves a new state-of-the-art SR performance through upscaling guided image resamplers which adaptively preserve detailed information that is essential to the upscaling. Experimental results indicate that the quality of the generated LR image is comparable to that of the traditional interpolation based method, but the significant SR performance gain is achieved by deep SR models trained jointly with the CAR model. The code is publicly available on: URL https://github.com/sunwj/CAR.