Abstract:Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.
Abstract:In this work, we present a novel forward differential deep learning-based algorithm for solving high-dimensional nonlinear backward stochastic differential equations (BSDEs). Motivated by the fact that differential deep learning can efficiently approximate the labels and their derivatives with respect to inputs, we transform the BSDE problem into a differential deep learning problem. This is done by leveraging Malliavin calculus, resulting in a system of BSDEs. The unknown solution of the BSDE system is a triple of processes $(Y, Z, \Gamma)$, representing the solution, its gradient, and the Hessian matrix. The main idea of our algorithm is to discretize the integrals using the Euler-Maruyama method and approximate the unknown discrete solution triple using three deep neural networks. The parameters of these networks are then optimized by globally minimizing a differential learning loss function, which is novelty defined as a weighted sum of the dynamics of the discretized system of BSDEs. Through various high-dimensional examples, we demonstrate that our proposed scheme is more efficient in terms of accuracy and computation time compared to other contemporary forward deep learning-based methodologies.
Abstract:Cardiac motion tracking from echocardiography can be used to estimate and quantify myocardial motion within a cardiac cycle. It is a cost-efficient and effective approach for assessing myocardial function. However, ultrasound imaging has the inherent characteristics of spatially low resolution and temporally random noise, which leads to difficulties in obtaining reliable annotation. Thus it is difficult to perform supervised learning for motion tracking. In addition, there is no end-to-end unsupervised method currently in the literature. This paper presents a motion tracking method where unsupervised optical flow networks are designed with spatial reconstruction loss and temporal-consistency loss. Our proposed loss functions make use of the pair-wise and temporal correlation to estimate cardiac motion from noisy background. Experiments using a synthetic 4D echocardiography dataset has shown the effectiveness of our approach, and its superiority over existing methods on both accuracy and running speed. To the best of our knowledge, this is the first work performed that uses unsupervised end-to-end deep learning optical flow network for 4D cardiac motion tracking.
Abstract:In this work, we propose a novel backward differential deep learning-based algorithm for solving high-dimensional nonlinear backward stochastic differential equations (BSDEs), where the deep neural network (DNN) models are trained not only on the inputs and labels but also the differentials of the corresponding labels. This is motivated by the fact that differential deep learning can provide an efficient approximation of the labels and their derivatives with respect to inputs. The BSDEs are reformulated as differential deep learning problems by using Malliavin calculus. The Malliavin derivatives of solution to a BSDE satisfy themselves another BSDE, resulting thus in a system of BSDEs. Such formulation requires the estimation of the solution, its gradient, and the Hessian matrix, represented by the triple of processes $\left(Y, Z, \Gamma\right).$ All the integrals within this system are discretized by using the Euler-Maruyama method. Subsequently, DNNs are employed to approximate the triple of these unknown processes. The DNN parameters are backwardly optimized at each time step by minimizing a differential learning type loss function, which is defined as a weighted sum of the dynamics of the discretized BSDE system, with the first term providing the dynamics of the process $Y$ and the other the process $Z$. An error analysis is carried out to show the convergence of the proposed algorithm. Various numerical experiments up to $50$ dimensions are provided to demonstrate the high efficiency. Both theoretically and numerically, it is demonstrated that our proposed scheme is more efficient compared to other contemporary deep learning-based methodologies, especially in the computation of the process $\Gamma$.
Abstract:In recent years, the rapid advancement of Artificial Intelligence Generated Content (AIGC) has attracted widespread attention. Among the AIGC, AI generated omnidirectional images hold significant potential for Virtual Reality (VR) and Augmented Reality (AR) applications, hence omnidirectional AIGC techniques have also been widely studied. AI-generated omnidirectional images exhibit unique distortions compared to natural omnidirectional images, however, there is no dedicated Image Quality Assessment (IQA) criteria for assessing them. This study addresses this gap by establishing a large-scale AI generated omnidirectional image IQA database named AIGCOIQA2024 and constructing a comprehensive benchmark. We first generate 300 omnidirectional images based on 5 AIGC models utilizing 25 text prompts. A subjective IQA experiment is conducted subsequently to assess human visual preferences from three perspectives including quality, comfortability, and correspondence. Finally, we conduct a benchmark experiment to evaluate the performance of state-of-the-art IQA models on our database. The database will be released to facilitate future research.
Abstract:Deep learning-based numerical schemes for solving high-dimensional backward stochastic differential equations (BSDEs) have recently raised plenty of scientific interest. While they enable numerical methods to approximate very high-dimensional BSDEs, their reliability has not been studied and is thus not understood. In this work, we study uncertainty quantification (UQ) for a class of deep learning-based BSDE schemes. More precisely, we review the sources of uncertainty involved in the schemes and numerically study the impact of different sources. Usually, the standard deviation (STD) of the approximate solutions obtained from multiple runs of the algorithm with different datasets is calculated to address the uncertainty. This approach is computationally quite expensive, especially for high-dimensional problems. Hence, we develop a UQ model that efficiently estimates the STD of the approximate solution using only a single run of the algorithm. The model also estimates the mean of the approximate solution, which can be leveraged to initialize the algorithm and improve the optimization process. Our numerical experiments show that the UQ model produces reliable estimates of the mean and STD of the approximate solution for the considered class of deep learning-based BSDE schemes. The estimated STD captures multiple sources of uncertainty, demonstrating its effectiveness in quantifying the uncertainty. Additionally, the model illustrates the improved performance when comparing different schemes based on the estimated STD values. Furthermore, it can identify hyperparameter values for which the scheme achieves good approximations.
Abstract:Transformers have shown significant effectiveness for various vision tasks including both high-level vision and low-level vision. Recently, masked autoencoders (MAE) for feature pre-training have further unleashed the potential of Transformers, leading to state-of-the-art performances on various high-level vision tasks. However, the significance of MAE pre-training on low-level vision tasks has not been sufficiently explored. In this paper, we show that masked autoencoders are also scalable self-supervised learners for image processing tasks. We first present an efficient Transformer model considering both channel attention and shifted-window-based self-attention termed CSformer. Then we develop an effective MAE architecture for image processing (MAEIP) tasks. Extensive experimental results show that with the help of MAEIP pre-training, our proposed CSformer achieves state-of-the-art performance on various image processing tasks, including Gaussian denoising, real image denoising, single-image motion deblurring, defocus deblurring, and image deraining.
Abstract:Massive multiple-input and multiple-output (MIMO) enables ultra-high throughput and low latency for tile-based adaptive virtual reality (VR) 360 video transmission in wireless network. In this paper, we consider a massive MIMO system where multiple users in a single-cell theater watch an identical VR 360 video. Based on tile prediction, base station (BS) deliveries the tiles in predicted field of view (FoV) to users. By introducing practical supplementary transmission for missing tiles and unacceptable VR sickness, we propose the first stable transmission scheme for VR video. we formulate an integer non-linear programming (INLP) problem to maximize users' average quality of experience (QoE) score. Moreover, we derive the achievable spectral efficiency (SE) expression of predictive tile groups and the approximately achievable SE expression of missing tile groups, respectively. Analytically, the overall throughput is related to the number of tile groups and the length of pilot sequences. By exploiting the relationship between the structure of viewport tiles and SE expression, we propose a multi-lattice multi-stream grouping method aimed at improving the overall throughput for VR video transmission. Moreover, we analyze the relationship between QoE objective and number of predictive tile. We transform the original INLP problem into an integer linear programming problem by setting the predictive tiles groups as some constants. With variable relaxation and recovery, we obtain the optimal average QoE. Extensive simulation results validate that the proposed algorithm effectively improves QoE.
Abstract:We study deep learning-based schemes for solving high dimensional nonlinear backward stochastic differential equations (BSDEs). First we show how to improve the performances of the proposed scheme in [W. E and J. Han and A. Jentzen, Commun. Math. Stat., 5 (2017), pp.349-380] regarding computational time and stability of numerical convergence by using the advanced neural network architecture instead of the stacked deep neural networks. Furthermore, the proposed scheme in that work can be stuck in local minima, especially for a complex solution structure and longer terminal time. To solve this problem, we investigate to reformulate the problem by including local losses and exploit the Long Short Term Memory (LSTM) networks which are a type of recurrent neural networks (RNN). Finally, in order to study numerical convergence and thus illustrate the improved performances with the proposed methods, we provide numerical results for several 100-dimensional nonlinear BSDEs including a nonlinear pricing problem in finance.