Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Jan 15, 2024

Morteza Moradi, Simone Palazzo, Concetto Spampinato

Figure 1 for Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Figure 2 for Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Figure 3 for Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Figure 4 for Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Share this with someone who'll enjoy it:

Abstract:In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.

* 8 pages, 2 figures, 3 tables

View paper on

Share this with someone who'll enjoy it:

Title:Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Paper and Code