Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marta Mrak

Efficient Convolution and Transformer-Based Network for Video Frame Interpolation

Jul 12, 2023

Issa Khalifeh, Luka Murn, Marta Mrak, Ebroul Izquierdo

Abstract:Video frame interpolation is an increasingly important research task with several key industrial applications in the video coding, broadcast and production sectors. Recently, transformers have been introduced to the field resulting in substantial performance gains. However, this comes at a cost of greatly increased memory usage, training and inference time. In this paper, a novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time compared to existing transformer-based interpolation methods. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies. Quantitative evaluations are conducted on various benchmarks with complex motion to showcase the robustness of the proposed method, achieving competitive performance compared to state-of-the-art interpolation networks.

* Paper accepted in IEEE ICIP 2023: International Conference on Image Processing 2023

Via

Access Paper or Ask Questions

Query-based Video Summarization with Pseudo Label Supervision

Jul 04, 2023

Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring

Abstract:Existing datasets for manually labelled query-based video summarization are costly and thus small, limiting the performance of supervised deep video summarization models. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels to pre-train a supervised deep model. In this work, we introduce segment-level pseudo labels from input videos to properly model both the relationship between a pretext task and a target task, and the implicit relationship between the pseudo label and the human-defined label. The pseudo labels are generated based on existing human-defined frame-level labels. To create more accurate query-dependent video summaries, a semantics booster is proposed to generate context-aware query representations. Furthermore, we propose mutual attention to help capture the interactive information between visual and textual modalities. Three commonly-used video summarization benchmarks are used to thoroughly validate the proposed approach. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.

* This paper is accepted by IEEE International Conference on Image Processing (ICIP), 2023

Via

Access Paper or Ask Questions

PeQuENet: Perceptual Quality Enhancement of Compressed Video with Adaptation- and Attention-based Network

Jun 16, 2022

Saiping Zhang, Luis Herranz, Marta Mrak, Marc Gorriz Blanch, Shuai Wan, Fuzheng Yang

Figure 1 for PeQuENet: Perceptual Quality Enhancement of Compressed Video with Adaptation- and Attention-based Network

Figure 2 for PeQuENet: Perceptual Quality Enhancement of Compressed Video with Adaptation- and Attention-based Network

Figure 3 for PeQuENet: Perceptual Quality Enhancement of Compressed Video with Adaptation- and Attention-based Network

Figure 4 for PeQuENet: Perceptual Quality Enhancement of Compressed Video with Adaptation- and Attention-based Network

Abstract:In this paper we propose a generative adversarial network (GAN) framework to enhance the perceptual quality of compressed videos. Our framework includes attention and adaptation to different quantization parameters (QPs) in a single model. The attention module exploits global receptive fields that can capture and align long-range correlations between consecutive frames, which can be beneficial for enhancing perceptual quality of videos. The frame to be enhanced is fed into the deep network together with its neighboring frames, and in the first stage features at different depths are extracted. Then extracted features are fed into attention blocks to explore global temporal correlations, followed by a series of upsampling and convolution layers. Finally, the resulting features are processed by the QP-conditional adaptation module which leverages the corresponding QP information. In this way, a single model can be used to enhance adaptively to various QPs without requiring multiple models specific for every QP value, while having similar performance. Experimental results demonstrate the superior performance of the proposed PeQuENet compared with the state-of-the-art compressed video quality enhancement algorithms.

Via

Access Paper or Ask Questions

Slimmable Video Codec

May 13, 2022

Zhaocheng Liu, Luis Herranz, Fei Yang, Saiping Zhang, Shuai Wan, Marta Mrak, Marc Górriz Blanch

Abstract:Neural video compression has emerged as a novel paradigm combining trainable multilayer neural networks and machine learning, achieving competitive rate-distortion (RD) performances, but still remaining impractical due to heavy neural architectures, with large memory and computational demands. In addition, models are usually optimized for a single RD tradeoff. Recent slimmable image codecs can dynamically adjust their model capacity to gracefully reduce the memory and computation requirements, without harming RD performance. In this paper we propose a slimmable video codec (SlimVC), by integrating a slimmable temporal entropy model in a slimmable autoencoder. Despite a significantly more complex architecture, we show that slimming remains a powerful mechanism to control rate, memory footprint, computational cost and latency, all being important requirements for practical video compression.

* Computer Vision and Pattern Recognition Workshop(CLIC2022)

Via

Access Paper or Ask Questions

Multi-encoder Network for Parameter Reduction of a Kernel-based Interpolation Architecture

May 13, 2022

Issa Khalifeh, Marc Gorriz Blanch, Ebroul Izquierdo, Marta Mrak

Figure 1 for Multi-encoder Network for Parameter Reduction of a Kernel-based Interpolation Architecture

Figure 2 for Multi-encoder Network for Parameter Reduction of a Kernel-based Interpolation Architecture

Figure 3 for Multi-encoder Network for Parameter Reduction of a Kernel-based Interpolation Architecture

Figure 4 for Multi-encoder Network for Parameter Reduction of a Kernel-based Interpolation Architecture

Abstract:Video frame interpolation involves the synthesis of new frames from existing ones. Convolutional neural networks (CNNs) have been at the forefront of the recent advances in this field. One popular CNN-based approach involves the application of generated kernels to the input frames to obtain an interpolated frame. Despite all the benefits interpolation methods offer, many of these networks require a lot of parameters, with more parameters meaning a heavier computational burden. Reducing the size of the model typically impacts performance negatively. This paper presents a method for parameter reduction for a popular flow-less kernel-based network (Adaptive Collaboration of Flows). Through our technique of removing the layers that require the most parameters and replacing them with smaller encoders, we reduce the number of parameters of the network and even achieve better performance compared to the original method. This is achieved by deploying rotation to force each individual encoder to learn different features from the input images. Ablations are conducted to justify design choices and an evaluation on how our method performs on full-length videos is presented.

* Paper accepted in NTIRE: New Trends in Image Restoration and Enhancement CVPR 2022 Workshop

Via

Access Paper or Ask Questions

Complexity Reduction of Learned In-Loop Filtering in Video Coding

Mar 17, 2022

Woody Bayliss, Luka Murn, Ebroul Izquierdo, Qianni Zhang, Marta Mrak

Figure 1 for Complexity Reduction of Learned In-Loop Filtering in Video Coding

Figure 2 for Complexity Reduction of Learned In-Loop Filtering in Video Coding

Figure 3 for Complexity Reduction of Learned In-Loop Filtering in Video Coding

Figure 4 for Complexity Reduction of Learned In-Loop Filtering in Video Coding

Abstract:In video coding, in-loop filters are applied on reconstructed video frames to enhance their perceptual quality, before storing the frames for output. Conventional in-loop filters are obtained by hand-crafted methods. Recently, learned filters based on convolutional neural networks that utilize attention mechanisms have been shown to improve upon traditional techniques. However, these solutions are typically significantly more computationally expensive, limiting their potential for practical applications. The proposed method uses a novel combination of sparsity and structured pruning for complexity reduction of learned in-loop filters. This is done through a three-step training process of magnitude-guidedweight pruning, insignificant neuron identification and removal, and fine-tuning. Through initial tests we find that network parameters can be significantly reduced with a minimal impact on network performance.

* 5 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions

DCNGAN: A Deformable Convolutional-Based GAN with QP Adaptation for Perceptual Quality Enhancement of Compressed Video

Jan 28, 2022

Saiping Zhang, Luis Herranz, Marta Mrak, Marc Gorriz Blanch, Shuai Wan, Fuzheng Yang

Figure 1 for DCNGAN: A Deformable Convolutional-Based GAN with QP Adaptation for Perceptual Quality Enhancement of Compressed Video

Figure 2 for DCNGAN: A Deformable Convolutional-Based GAN with QP Adaptation for Perceptual Quality Enhancement of Compressed Video

Figure 3 for DCNGAN: A Deformable Convolutional-Based GAN with QP Adaptation for Perceptual Quality Enhancement of Compressed Video

Figure 4 for DCNGAN: A Deformable Convolutional-Based GAN with QP Adaptation for Perceptual Quality Enhancement of Compressed Video

Abstract:In this paper, we propose a deformable convolution-based generative adversarial network (DCNGAN) for perceptual quality enhancement of compressed videos. DCNGAN is also adaptive to the quantization parameters (QPs). Compared with optical flows, deformable convolutions are more effective and efficient to align frames. Deformable convolutions can operate on multiple frames, thus leveraging more temporal information, which is beneficial for enhancing the perceptual quality of compressed videos. Instead of aligning frames in a pairwise manner, the deformable convolution can process multiple frames simultaneously, which leads to lower computational complexity. Experimental results demonstrate that the proposed DCNGAN outperforms other state-of-the-art compressed video quality enhancement algorithms.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

DVC-P: Deep Video Compression with Perceptual Optimizations

Oct 08, 2021

Saiping Zhang, Marta Mrak, Luis Herranz, Marc Górriz, Shuai Wan, Fuzheng Yang

Figure 1 for DVC-P: Deep Video Compression with Perceptual Optimizations

Figure 2 for DVC-P: Deep Video Compression with Perceptual Optimizations

Figure 3 for DVC-P: Deep Video Compression with Perceptual Optimizations

Figure 4 for DVC-P: Deep Video Compression with Perceptual Optimizations

Abstract:Recent years have witnessed the significant development of learning-based video compression methods, which aim at optimizing objective or perceptual quality and bit rates. In this paper, we introduce deep video compression with perceptual optimizations (DVC-P), which aims at increasing perceptual quality of decoded videos. Our proposed DVC-P is based on Deep Video Compression (DVC) network, but improves it with perceptual optimizations. Specifically, a discriminator network and a mixed loss are employed to help our network trade off among distortion, perception and rate. Furthermore, nearest-neighbor interpolation is used to eliminate checkerboard artifacts which can appear in sequences encoded with DVC frameworks. Thanks to these two improvements, the perceptual quality of decoded sequences is improved. Experimental results demonstrate that, compared with the baseline DVC, our proposed method can generate videos with higher perceptual quality achieving 12.27% reduction in a perceptual BD-rate equivalent, on average.

Via

Access Paper or Ask Questions

Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding

Jun 16, 2021

Luka Murn, Saverio Blasi, Alan F. Smeaton, Marta Mrak

Figure 1 for Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding

Figure 2 for Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding

Figure 3 for Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding

Figure 4 for Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding

Abstract:The versatility of recent machine learning approaches makes them ideal for improvement of next generation video compression solutions. Unfortunately, these approaches typically bring significant increases in computational complexity and are difficult to interpret into explainable models, affecting their potential for implementation within practical video coding applications. This paper introduces a novel explainable neural network-based inter-prediction scheme, to improve the interpolation of reference samples needed for fractional precision motion compensation. The approach requires a single neural network to be trained from which a full quarter-pixel interpolation filter set is derived, as the network is easily interpretable due to its linear structure. A novel training framework enables each network branch to resemble a specific fractional shift. This practical solution makes it very efficient to use alongside conventional video coding schemes. When implemented in the context of the state-of-the-art Versatile Video Coding (VVC) test model, 0.77%, 1.27% and 2.25% BD-rate savings can be achieved on average for lower resolution sequences under the random access, low-delay B and low-delay P configurations, respectively, while the complexity of the learned interpolation schemes is significantly reduced compared to the interpolation with full CNNs.

* IEEE Open Journal of Signal Processing Special Issue on Applied AI and Machine Learning for Video Coding and Streaming, June 2021

Via

Access Paper or Ask Questions

Towards Transparent Application of Machine Learning in Video Processing

May 27, 2021

Luka Murn, Marc Gorriz Blanch, Maria Santamaria, Fiona Rivera, Marta Mrak

Figure 1 for Towards Transparent Application of Machine Learning in Video Processing

Figure 2 for Towards Transparent Application of Machine Learning in Video Processing

Figure 3 for Towards Transparent Application of Machine Learning in Video Processing

Figure 4 for Towards Transparent Application of Machine Learning in Video Processing

Abstract:Machine learning techniques for more efficient video compression and video enhancement have been developed thanks to breakthroughs in deep learning. The new techniques, considered as an advanced form of Artificial Intelligence (AI), bring previously unforeseen capabilities. However, they typically come in the form of resource-hungry black-boxes (overly complex with little transparency regarding the inner workings). Their application can therefore be unpredictable and generally unreliable for large-scale use (e.g. in live broadcast). The aim of this work is to understand and optimise learned models in video processing applications so systems that incorporate them can be used in a more trustworthy manner. In this context, the presented work introduces principles for simplification of learned models targeting improved transparency in implementing machine learning for video production and distribution applications. These principles are demonstrated on video compression examples, showing how bitrate savings and reduced complexity can be achieved by simplifying relevant deep learning models.

* International Broadcasting Convention, 11-14 Sep 2020, Amsterdam, Netherlands (Technical Paper section, Virtual)

Via

Access Paper or Ask Questions