Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alan C. Bovik

HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation

May 27, 2025

Bowen Chen, Cheng-han Lee, Yixu Chen, Zaixi Shang, Hai Wei, Alan C. Bovik

Abstract:We introduce HDRSDR-VQA, a large-scale video quality assessment dataset designed to facilitate comparative analysis between High Dynamic Range (HDR) and Standard Dynamic Range (SDR) content under realistic viewing conditions. The dataset comprises 960 videos generated from 54 diverse source sequences, each presented in both HDR and SDR formats across nine distortion levels. To obtain reliable perceptual quality scores, we conducted a comprehensive subjective study involving 145 participants and six consumer-grade HDR-capable televisions. A total of over 22,000 pairwise comparisons were collected and scaled into Just-Objectionable-Difference (JOD) scores. Unlike prior datasets that focus on a single dynamic range format or use limited evaluation protocols, HDRSDR-VQA enables direct content-level comparison between HDR and SDR versions, supporting detailed investigations into when and why one format is preferred over the other. The open-sourced part of the dataset is publicly available to support further research in video quality assessment, content-adaptive streaming, and perceptual model development.

Via

Access Paper or Ask Questions

Video Quality Assessment: A Comprehensive Survey

Dec 04, 2024

Qi Zheng, Yibo Fan, Leilei Huang, Tianyu Zhu, Jiaming Liu, Zhijian Hao, Shuo Xing, Chia-Ju Chen, Xiongkuo Min, Alan C. Bovik(+1 more)

Abstract:Video quality assessment (VQA) is an important processing task, aiming at predicting the quality of videos in a manner highly consistent with human judgments of perceived quality. Traditional VQA models based on natural image and/or video statistics, which are inspired both by models of projected images of the real world and by dual models of the human visual system, deliver only limited prediction performances on real-world user-generated content (UGC), as exemplified in recent large-scale VQA databases containing large numbers of diverse video contents crawled from the web. Fortunately, recent advances in deep neural networks and Large Multimodality Models (LMMs) have enabled significant progress in solving this problem, yielding better results than prior handcrafted models. Numerous deep learning-based VQA models have been developed, with progress in this direction driven by the creation of content-diverse, large-scale human-labeled databases that supply ground truth psychometric video quality data. Here, we present a comprehensive survey of recent progress in the development of VQA algorithms and the benchmarking studies and databases that make them possible. We also analyze open research directions on study design and VQA algorithm architectures.

Via

Access Paper or Ask Questions

MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers

Nov 26, 2024

Ruoxi Zhu, Zhengzhong Tu, Jiaming Liu, Alan C. Bovik, Yibo Fan

Figure 1 for MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers

Figure 2 for MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers

Figure 3 for MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers

Figure 4 for MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers

Abstract:Restoring images captured under adverse weather conditions is a fundamental task for many computer vision applications. However, most existing weather restoration approaches are only capable of handling a specific type of degradation, which is often insufficient in real-world scenarios, such as rainy-snowy or rainy-hazy weather. Towards being able to address these situations, we propose a multi-weather Transformer, or MWFormer for short, which is a holistic vision Transformer that aims to solve multiple weather-induced degradations using a single, unified architecture. MWFormer uses hyper-networks and feature-wise linear modulation blocks to restore images degraded by various weather types using the same set of learned parameters. We first employ contrastive learning to train an auxiliary network that extracts content-independent, distortion-aware feature embeddings that efficiently represent predicted weather types, of which more than one may occur. Guided by these weather-informed predictions, the image restoration Transformer adaptively modulates its parameters to conduct both local and global feature processing, in response to multiple possible weather. Moreover, MWFormer allows for a novel way of tuning, during application, to either a single type of weather restoration or to hybrid weather restoration without any retraining, offering greater controllability than existing methods. Our experimental results on multi-weather restoration benchmarks show that MWFormer achieves significant performance improvements compared to existing state-of-the-art methods, without requiring much computational cost. Moreover, we demonstrate that our methodology of using hyper-networks can be integrated into various network architectures to further boost their performance. The code is available at: https://github.com/taco-group/MWFormer

* Accepted by IEEE Transactions on Image Processing. The code is available at: https://github.com/taco-group/MWFormer

Via

Access Paper or Ask Questions

Satellite Streaming Video QoE Prediction: A Real-World Subjective Database and Network-Level Prediction Models

Oct 17, 2024

Bowen Chen, Zaixi Shang, Jae Won Chung, David Lerner, Werner Robitza, Rakesh Rao Ramachandra Rao, Alexander Raake, Alan C. Bovik

Figure 1 for Satellite Streaming Video QoE Prediction: A Real-World Subjective Database and Network-Level Prediction Models

Figure 2 for Satellite Streaming Video QoE Prediction: A Real-World Subjective Database and Network-Level Prediction Models

Figure 3 for Satellite Streaming Video QoE Prediction: A Real-World Subjective Database and Network-Level Prediction Models

Figure 4 for Satellite Streaming Video QoE Prediction: A Real-World Subjective Database and Network-Level Prediction Models

Abstract:Demand for streaming services, including satellite, continues to exhibit unprecedented growth. Internet Service Providers find themselves at the crossroads of technological advancements and rising customer expectations. To stay relevant and competitive, these ISPs must ensure their networks deliver optimal video streaming quality, a key determinant of user satisfaction. Towards this end, it is important to have accurate Quality of Experience prediction models in place. However, achieving robust performance by these models requires extensive data sets labeled by subjective opinion scores on videos impaired by diverse playback disruptions. To bridge this data gap, we introduce the LIVE-Viasat Real-World Satellite QoE Database. This database consists of 179 videos recorded from real-world streaming services affected by various authentic distortion patterns. We also conducted a comprehensive subjective study involving 54 participants, who contributed both continuous-time opinion scores and endpoint (retrospective) QoE scores. Our analysis sheds light on various determinants influencing subjective QoE, such as stall events, spatial resolutions, bitrate, and certain network parameters. We demonstrate the usefulness of this unique new resource by evaluating the efficacy of prevalent QoE-prediction models on it. We also created a new model that maps the network parameters to predicted human perception scores, which can be used by ISPs to optimize the video streaming quality of their networks. Our proposed model, which we call SatQA, is able to accurately predict QoE using only network parameters, without any access to pixel data or video-specific metadata, estimated by Spearman's Rank Order Correlation Coefficient (SROCC), Pearson Linear Correlation Coefficient (PLCC), and Root Mean Squared Error (RMSE), indicating high accuracy and reliability.

Via

Access Paper or Ask Questions

Quality Prediction of AI Generated Images and Videos: Emerging Trends and Opportunities

Oct 11, 2024

Abhijay Ghildyal, Yuanhan Chen, Saman Zadtootaghaj, Nabajeet Barman, Alan C. Bovik

Abstract:The advent of AI has influenced many aspects of human life, from self-driving cars and intelligent chatbots to text-based image and video generation models capable of creating realistic images and videos based on user prompts (text-to-image, image-to-image, and image-to-video). AI-based methods for image and video super resolution, video frame interpolation, denoising, and compression have already gathered significant attention and interest in the industry and some solutions are already being implemented in real-world products and services. However, to achieve widespread integration and acceptance, AI-generated and enhanced content must be visually accurate, adhere to intended use, and maintain high visual quality to avoid degrading the end user's quality of experience (QoE). One way to monitor and control the visual "quality" of AI-generated and -enhanced content is by deploying Image Quality Assessment (IQA) and Video Quality Assessment (VQA) models. However, most existing IQA and VQA models measure visual fidelity in terms of "reconstruction" quality against a pristine reference content and were not designed to assess the quality of "generative" artifacts. To address this, newer metrics and models have recently been proposed, but their performance evaluation and overall efficacy have been limited by datasets that were too small or otherwise lack representative content and/or distortion capacity; and by performance measures that can accurately report the success of an IQA/VQA model for "GenAI". This paper examines the current shortcomings and possibilities presented by AI-generated and enhanced image and video content, with a particular focus on end-user perceived quality. Finally, we discuss open questions and make recommendations for future work on the "GenAI" quality assessment problems, towards further progressing on this interesting and relevant field of research.

* "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

Via

Access Paper or Ask Questions

Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity

Aug 04, 2024

Krishna Srikar Durbha, Alan C. Bovik

Figure 1 for Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity

Figure 2 for Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity

Figure 3 for Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity

Figure 4 for Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity

Abstract:Adaptive video streaming allows for the construction of bitrate ladders that deliver perceptually optimized visual quality to viewers under bandwidth constraints. Two common approaches to adaptation are per-title encoding and per-shot encoding. The former involves encoding each program, movie, or other content in a manner that is perceptually- and bandwidth-optimized for that content but is otherwise fixed. The latter is a more granular approach that optimizes the encoding parameters for each scene or shot (however defined) of a video content. Per-shot video encoding, as pioneered by Netflix, encodes on a per-shot basis using the Dynamic Optimizer (DO). Under the control of the VMAF perceptual video quality prediction engine, the DO delivers high-quality videos to millions of viewers at considerably reduced bitrates than per-title or fixed bitrate ladder encoding. A variety of per-title and per-shot encoding techniques have been recently proposed that seek to reduce computational overhead and to construct optimal bitrate ladders more efficiently using low-level features extracted from source videos. Here we develop a perceptually optimized method of constructing optimal per-shot bitrate and quality ladders, using an ensemble of low-level features and Visual Information Fidelity (VIF) features extracted from different scales and subbands. We compare the performance of our model, which we call VIF-ladder, against other content-adaptive bitrate ladder prediction methods, counterparts of them that we designed to construct quality ladders, a fixed bitrate ladder, and bitrate ladders constructed via exhaustive encoding using Bjontegaard delta metrics.

* Under Review

Via

Access Paper or Ask Questions

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Jun 24, 2024

Sandeep Mishra, Oindrila Saha, Alan C. Bovik

Figure 1 for YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Figure 2 for YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Figure 3 for YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Figure 4 for YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Abstract:3D generation guided by text-to-image diffusion models enables the creation of visually compelling assets. However previous methods explore generation based on image or text. The boundaries of creativity are limited by what can be expressed through words or the images that can be sourced. We present YouDream, a method to generate high-quality anatomically controllable animals. YouDream is guided using a text-to-image diffusion model controlled by 2D views of a 3D pose prior. Our method generates 3D animals that are not possible to create using previous text-to-3D generative methods. Additionally, our method is capable of preserving anatomic consistency in the generated animals, an area where prior text-to-3D approaches often struggle. Moreover, we design a fully automated pipeline for generating commonly found animals. To circumvent the need for human intervention to create a 3D pose, we propose a multi-agent LLM that adapts poses from a limited library of animal 3D poses to represent the desired animal. A user study conducted on the outcomes of YouDream demonstrates the preference of the animal models generated by our method over others. Turntable results and code are released at https://youdream3d.github.io/

Via

Access Paper or Ask Questions

C3DAG: Controlled 3D Animal Generation using 3D pose guidance

Jun 11, 2024

Sandeep Mishra, Oindrila Saha, Alan C. Bovik

Abstract:Recent advancements in text-to-3D generation have demonstrated the ability to generate high quality 3D assets. However while generating animals these methods underperform, often portraying inaccurate anatomy and geometry. Towards ameliorating this defect, we present C3DAG, a novel pose-Controlled text-to-3D Animal Generation framework which generates a high quality 3D animal consistent with a given pose. We also introduce an automatic 3D shape creator tool, that allows dynamic pose generation and modification via a web-based tool, and that generates a 3D balloon animal using simple geometries. A NeRF is then initialized using this 3D shape using depth-controlled SDS. In the next stage, the pre-trained NeRF is fine-tuned using quadruped-pose-controlled SDS. The pipeline that we have developed not only produces geometrically and anatomically consistent results, but also renders highly controlled 3D animals, unlike prior methods which do not allow fine-grained pose control.

Via

Access Paper or Ask Questions

Cut-FUNQUE: An Objective Quality Model for Compressed Tone-Mapped High Dynamic Range Videos

Apr 20, 2024

Abhinau K. Venkataramanan, Cosmin Stejerean, Ioannis Katsavounidis, Hassene Tmar, Alan C. Bovik

Figure 1 for Cut-FUNQUE: An Objective Quality Model for Compressed Tone-Mapped High Dynamic Range Videos

Figure 2 for Cut-FUNQUE: An Objective Quality Model for Compressed Tone-Mapped High Dynamic Range Videos

Figure 3 for Cut-FUNQUE: An Objective Quality Model for Compressed Tone-Mapped High Dynamic Range Videos

Figure 4 for Cut-FUNQUE: An Objective Quality Model for Compressed Tone-Mapped High Dynamic Range Videos

Abstract:High Dynamic Range (HDR) videos have enjoyed a surge in popularity in recent years due to their ability to represent a wider range of contrast and color than Standard Dynamic Range (SDR) videos. Although HDR video capture has seen increasing popularity because of recent flagship mobile phones such as Apple iPhones, Google Pixels, and Samsung Galaxy phones, a broad swath of consumers still utilize legacy SDR displays that are unable to display HDR videos. As result, HDR videos must be processed, i.e., tone-mapped, before streaming to a large section of SDR-capable video consumers. However, server-side tone-mapping involves automating decisions regarding the choices of tone-mapping operators (TMOs) and their parameters to yield high-fidelity outputs. Moreover, these choices must be balanced against the effects of lossy compression, which is ubiquitous in streaming scenarios. In this work, we develop a novel, efficient model of objective video quality named Cut-FUNQUE that is able to accurately predict the visual quality of tone-mapped and compressed HDR videos. Finally, we evaluate Cut-FUNQUE on a large-scale crowdsourced database of such videos and show that it achieves state-of-the-art accuracy.

Via

Access Paper or Ask Questions

Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content

Apr 20, 2024

Abhinau K. Venkataramanan, Cosmin Stejerean, Ioannis Katsavounidis, Hassene Tmar, Alan C. Bovik

Figure 1 for Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content

Figure 2 for Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content

Figure 3 for Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content

Figure 4 for Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content

Abstract:The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.

Via

Access Paper or Ask Questions