Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

A. Murat Tekalp

FG-DFPN: Flow Guided Deformable Frame Prediction Network

Mar 14, 2025

M. Akın Yılmaz, Ahmet Bilican, A. Murat Tekalp

Abstract:Video frame prediction remains a fundamental challenge in computer vision with direct implications for autonomous systems, video compression, and media synthesis. We present FG-DFPN, a novel architecture that harnesses the synergy between optical flow estimation and deformable convolutions to model complex spatio-temporal dynamics. By guiding deformable sampling with motion cues, our approach addresses the limitations of fixed-kernel networks when handling diverse motion patterns. The multi-scale design enables FG-DFPN to simultaneously capture global scene transformations and local object movements with remarkable precision. Our experiments demonstrate that FG-DFPN achieves state-of-the-art performance on eight diverse MPEG test sequences, outperforming existing methods by 1dB PSNR while maintaining competitive inference speeds. The integration of motion cues with adaptive geometric transformations makes FG-DFPN a promising solution for next-generation video processing systems that require high-fidelity temporal predictions. The model and instructions to reproduce our results will be released at: https://github.com/KUIS-AI-Tekalp-Research Group/frame-prediction

* Submitted to 33th European Signal Processing Conference (EUSIPCO) 2025

Via

Access Paper or Ask Questions

On the Computation of BD-Rate over a Set of Videos for Fair Assessment of Performance of Learned Video Codecs

Sep 13, 2024

M. Akin Yilmaz, Onur Keleş, A. Murat Tekalp

Figure 1 for On the Computation of BD-Rate over a Set of Videos for Fair Assessment of Performance of Learned Video Codecs

Figure 2 for On the Computation of BD-Rate over a Set of Videos for Fair Assessment of Performance of Learned Video Codecs

Figure 3 for On the Computation of BD-Rate over a Set of Videos for Fair Assessment of Performance of Learned Video Codecs

Figure 4 for On the Computation of BD-Rate over a Set of Videos for Fair Assessment of Performance of Learned Video Codecs

Abstract:The Bj{\o}ntegaard Delta (BD) measure is widely employed to evaluate and quantify the variations in the rate-distortion(RD) performance across different codecs. Many researchers report the average BD value over multiple videos within a dataset for different codecs. We claim that the current practice in the learned video compression community of computing the average BD value over a dataset based on the average RD curve of multiple videos can lead to misleading conclusions. We show both by analysis of a simplistic case of linear RD curves and experimental results with two recent learned video codecs that averaging RD curves can lead to a single video to disproportionately influence the average BD value especially when the operating bitrate range of different codecs do not exactly match. Instead, we advocate for calculating the BD measure per-video basis, as commonly done by the traditional video compression community, followed by averaging the individual BD values over videos, to provide a fair comparison of learned video codecs. Our experimental results demonstrate that the comparison of two recent learned video codecs is affected by how we evaluate the average BD measure.

* Submitted to IEEE ICASSP 2025

Via

Access Paper or Ask Questions

A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks

Apr 19, 2024

Ronglei Ji, A. Murat Tekalp

Figure 1 for A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks

Figure 2 for A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks

Figure 3 for A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks

Figure 4 for A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks

Abstract:Despite the fact real-world video deinterlacing and demosaicing are well-suited to supervised learning from synthetically degraded data because the degradation models are known and fixed, learned video deinterlacing and demosaicing have received much less attention compared to denoising and super-resolution tasks. We propose a new multi-picture architecture for video deinterlacing or demosaicing by aligning multiple supporting pictures with missing data to a reference picture to be reconstructed, benefiting from both local and global spatio-temporal correlations in the feature space using modified deformable convolution blocks and a novel residual efficient top-$k$ self-attention (kSA) block, respectively. Separate reconstruction blocks are used to estimate different types of missing data. Our extensive experimental results, on synthetic or real-world datasets, demonstrate that the proposed novel architecture provides superior results that significantly exceed the state-of-the-art for both tasks in terms of PSNR, SSIM, and perceptual quality. Ablation studies are provided to justify and show the benefit of each novel modification made to the deformable convolution and residual efficient kSA blocks. Code is available: https://github.com/KUIS-AI-Tekalp-Research-Group/Video-Deinterlacing.

* 13 pages, 6 figures, accepted to IMAVIS

Via

Access Paper or Ask Questions

Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution

Apr 17, 2024

Cansu Korkmaz, A. Murat Tekalp

Abstract:Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlapping windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate for capturing essential high-frequency details. This paper presents two contributions: i) We introduce convolutional non-local sparse attention (NLSA) blocks to extend the hybrid transformer architecture in order to further enhance its receptive field. ii) We employ wavelet losses to train Transformer models to improve quantitative and subjective performance. While wavelet losses have been explored previously, showing their power in training Transformer-based SR models is novel. Our experimental results demonstrate that the proposed model provides state-of-the-art PSNR results as well as superior visual performance across various benchmark datasets.

* total of 10 pages including references, 5 tables and 5 figures, accepted for NTIRE 2024 Single Image Super Resolution (x4) challenge

Via

Access Paper or Ask Questions

NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results

Apr 15, 2024

Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong(+78 more)

$Figure 1 for NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results$

$Figure 2 for NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results$

$Figure 3 for NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results$

$Figure 4 for NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results$

Abstract:This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.

* NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

Via

Access Paper or Ask Questions

PAON: A New Neuron Model using Padé Approximants

Mar 18, 2024

Onur Keleş, A. Murat Tekalp

Abstract:Convolutional neural networks (CNN) are built upon the classical McCulloch-Pitts neuron model, which is essentially a linear model, where the nonlinearity is provided by a separate activation function. Several researchers have proposed enhanced neuron models, including quadratic neurons, generalized operational neurons, generative neurons, and super neurons, with stronger nonlinearity than that provided by the pointwise activation function. There has also been a proposal to use Pade approximation as a generalized activation function. In this paper, we introduce a brand new neuron model called Pade neurons (Paons), inspired by the Pade approximants, which is the best mathematical approximation of a transcendental function as a ratio of polynomials with different orders. We show that Paons are a super set of all other proposed neuron models. Hence, the basic neuron in any known CNN model can be replaced by Paons. In this paper, we extend the well-known ResNet to PadeNet (built by Paons) to demonstrate the concept. Our experiments on the single-image super-resolution task show that PadeNets can obtain better results than competing architectures.

* Submitted to IEEE ICIP 2024

Via

Access Paper or Ask Questions

Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts

Feb 29, 2024

Cansu Korkmaz, A. Murat Tekalp, Zafer Dogan

Abstract:Super-resolution (SR) is an ill-posed inverse problem, where the size of the set of feasible solutions that are consistent with a given low-resolution image is very large. Many algorithms have been proposed to find a "good" solution among the feasible solutions that strike a balance between fidelity and perceptual quality. Unfortunately, all known methods generate artifacts and hallucinations while trying to reconstruct high-frequency (HF) image details. A fundamental question is: Can a model learn to distinguish genuine image details from artifacts? Although some recent works focused on the differentiation of details and artifacts, this is a very challenging problem and a satisfactory solution is yet to be found. This paper shows that the characterization of genuine HF details versus artifacts can be better learned by training GAN-based SR models using wavelet-domain loss functions compared to RGB-domain or Fourier-space losses. Although wavelet-domain losses have been used in the literature before, they have not been used in the context of the SR task. More specifically, we train the discriminator only on the HF wavelet sub-bands instead of on RGB images and the generator is trained by a fidelity loss over wavelet subbands to make it sensitive to the scale and orientation of structures. Extensive experimental results demonstrate that our model achieves better perception-distortion trade-off according to multiple objective measures and visual evaluations.

* Accepted for IEEE CVPR 2024, total of 11 pages, 3 pages for references, 7 figures and 2 tables

Via

Access Paper or Ask Questions

Saliency-aware End-to-end Learned Variable-Bitrate 360-degree Image Compression

Feb 14, 2024

Oguzhan Gungordu, A. Murat Tekalp

Abstract:Effective compression of 360$^\circ$ images, also referred to as omnidirectional images (ODIs), is of high interest for various virtual reality (VR) and related applications. 2D image compression methods ignore the equator-biased nature of ODIs and fail to address oversampling near the poles, leading to inefficient compression when applied to ODI. We present a new learned saliency-aware 360$^\circ$ image compression architecture that prioritizes bit allocation to more significant regions, considering the unique properties of ODIs. By assigning fewer bits to less important regions, significant data size reduction can be achieved while maintaining high visual quality in the significant regions. To the best of our knowledge, this is the first study that proposes an end-to-end variable-rate model to compress 360$^\circ$ images leveraging saliency information. The results show significant bit-rate savings over the state-of-the-art learned and traditional ODI compression methods at similar perceptual visual quality.

* 7 pages with double column, 1 and a half for references, 6 figures and 4 tables, submitted to IEEE ICIP 2024

Via

Access Paper or Ask Questions

Motion-Adaptive Inference for Flexible Learned B-Frame Compression

Feb 13, 2024

M. Akin Yilmaz, O. Ugur Ulas, Ahmet Bilican, A. Murat Tekalp

Figure 1 for Motion-Adaptive Inference for Flexible Learned B-Frame Compression

Figure 2 for Motion-Adaptive Inference for Flexible Learned B-Frame Compression

Figure 3 for Motion-Adaptive Inference for Flexible Learned B-Frame Compression

Figure 4 for Motion-Adaptive Inference for Flexible Learned B-Frame Compression

Abstract:While the performance of recent learned intra and sequential video compression models exceed that of respective traditional codecs, the performance of learned B-frame compression models generally lag behind traditional B-frame coding. The performance gap is bigger for complex scenes with large motions. This is related to the fact that the distance between the past and future references vary in hierarchical B-frame compression depending on the level of hierarchy, which causes motion range to vary. The inability of a single B-frame compression model to adapt to various motion ranges causes loss of performance. As a remedy, we propose controlling the motion range for flow prediction during inference (to approximately match the range of motions in the training data) by downsampling video frames adaptively according to amount of motion and level of hierarchy in order to compress all B-frames using a single flexible-rate model. We present state-of-the-art BD rate results to demonstrate the superiority of our proposed single-model motion-adaptive inference approach to all existing learned B-frame compression models.

* 7 pages, submitted to IEEE ICIP 2024

Via

Access Paper or Ask Questions

Trustworthy SR: Resolving Ambiguity in Image Super-resolution via Diffusion Models and Human Feedback

Feb 12, 2024

Cansu Korkmaz, Ege Cirakman, A. Murat Tekalp, Zafer Dogan

Figure 1 for Trustworthy SR: Resolving Ambiguity in Image Super-resolution via Diffusion Models and Human Feedback

Figure 2 for Trustworthy SR: Resolving Ambiguity in Image Super-resolution via Diffusion Models and Human Feedback

Figure 3 for Trustworthy SR: Resolving Ambiguity in Image Super-resolution via Diffusion Models and Human Feedback

Figure 4 for Trustworthy SR: Resolving Ambiguity in Image Super-resolution via Diffusion Models and Human Feedback

Abstract:Super-resolution (SR) is an ill-posed inverse problem with a large set of feasible solutions that are consistent with a given low-resolution image. Various deterministic algorithms aim to find a single solution that balances fidelity and perceptual quality; however, this trade-off often causes visual artifacts that bring ambiguity in information-centric applications. On the other hand, diffusion models (DMs) excel in generating a diverse set of feasible SR images that span the solution space. The challenge is then how to determine the most likely solution among this set in a trustworthy manner. We observe that quantitative measures, such as PSNR, LPIPS, DISTS, are not reliable indicators to resolve ambiguous cases. To this effect, we propose employing human feedback, where we ask human subjects to select a small number of likely samples and we ensemble the averages of selected samples. This strategy leverages the high-quality image generation capabilities of DMs, while recognizing the importance of obtaining a single trustworthy solution, especially in use cases, such as identification of specific digits or letters, where generating multiple feasible solutions may not lead to a reliable outcome. Experimental results demonstrate that our proposed strategy provides more trustworthy solutions when compared to state-of-the art SR methods.

* total of 7 pages with double column, 1 and a half for references, 6 figures and 2 tables, submitted to IEEE ICIP 2024

Via

Access Paper or Ask Questions