Abstract:Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
Abstract:In recent years, the development of interconnected devices has expanded in many fields, from infotainment to education and industrial applications. This trend has been accelerated by the increased number of sensors and accessibility to powerful hardware and software. One area that significantly benefits from these advancements is Teleoperated Driving (TD). In this scenario, a controller drives safely a vehicle from remote leveraging sensors data generated onboard the vehicle, and exchanged via Vehicle-to-Everything (V2X) communications. In this work, we tackle the problem of detecting the presence of cars and pedestrians from point cloud data to enable safe TD operations. More specifically, we exploit the SELMA dataset, a multimodal, open-source, synthetic dataset for autonomous driving, that we expanded by including the ground-truth bounding boxes of 3D objects to support object detection. We analyze the performance of state-of-the-art compression algorithms and object detectors under several metrics, including compression efficiency, (de)compression and inference time, and detection accuracy. Moreover, we measure the impact of compression and detection on the V2X network in terms of data rate and latency with respect to 3GPP requirements for TD applications.
Abstract:During the compression, transmission, and rendering of point clouds, various artifacts are introduced, affecting the quality perceived by the end user. However, evaluating the impact of these distortions on the overall quality is a challenging task. This study introduces PST-PCQA, a no-reference point cloud quality metric based on a low-complexity, learning-based framework. It evaluates point cloud quality by analyzing individual patches, integrating local and global features to predict the Mean Opinion Score. In summary, the process involves extracting features from patches, combining them, and using correlation weights to predict the overall quality. This approach allows us to assess point cloud quality without relying on a reference point cloud, making it particularly useful in scenarios where reference data is unavailable. Experimental tests on three state-of-the-art datasets show good prediction capabilities of PST-PCQA, through the analysis of different feature pooling strategies and its ability to generalize across different datasets. The ablation study confirms the benefits of evaluating quality on a patch-by-patch basis. Additionally, PST-PCQA's light-weight structure, with a small number of parameters to learn, makes it well-suited for real-time applications and devices with limited computational capacity. For reproducibility purposes, we made code, model, and pretrained weights available at https://github.com/michaelneri/PST-PCQA.