Abstract:In the research of video quality assessment (VQA), two-branch network has emerged as a promising solution. It decouples VQA with separate technical and aesthetic branches to measure the perception of low-level distortions and high-level semantics respectively. However, we argue that while technical and aesthetic perspectives are complementary, the technical perspective itself should be measured in semantic-aware manner. We hypothesize that existing technical branch struggles to perceive the semantics of high-resolution videos, as it is trained on local mini-patches sampled from videos. This issue can be hidden by apparently good results on low-resolution videos, but indeed becomes critical for high-resolution VQA. This work introduces SiamVQA, a simple but effective Siamese network for highre-solution VQA. SiamVQA shares weights between technical and aesthetic branches, enhancing the semantic perception ability of technical branch to facilitate technical-quality representation learning. Furthermore, it integrates a dual cross-attention layer for fusing technical and aesthetic features. SiamVQA achieves state-of-the-art accuracy on high-resolution benchmarks, and competitive results on lower-resolution benchmarks. Codes will be available at: https://github.com/srcn-ivl/SiamVQA
Abstract:Video Individual Counting (VIC) aims to predict the number of unique individuals in a single video. % Existing methods learn representations based on trajectory labels for individuals, which are annotation-expensive. % To provide a more realistic reflection of the underlying practical challenge, we introduce a weakly supervised VIC task, wherein trajectory labels are not provided. Instead, two types of labels are provided to indicate traffic entering the field of view (inflow) and leaving the field view (outflow). % We also propose the first solution as a baseline that formulates the task as a weakly supervised contrastive learning problem under group-level matching. In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining. % To facilitate future study in this direction, we generate annotations from the existing VIC datasets SenseCrowd and CroHD and also build a new dataset, UAVVIC. % Extensive results show that our baseline weakly supervised method outperforms supervised methods, and thus, little information is lost in the transition to the more practically relevant weakly supervised task. The code and trained model will be public at \href{https://github.com/streamer-AP/CGNet}{CGNet}
Abstract:Crowd counting is usually handled in a density map regression fashion, which is supervised via a L2 loss between the predicted density map and ground truth. To effectively regulate models, various improved L2 loss functions have been proposed to find a better correspondence between predicted density and annotation positions. In this paper, we propose to predict the density map at one resolution but measure the density map at multiple resolutions. By maximizing the posterior probability in such a setting, we obtain a log-formed multi-resolution L2-difference loss, where the traditional single-resolution L2 loss is its particular case. We mathematically prove it is superior to a single-resolution L2 loss. Without bells and whistles, the proposed loss substantially improves several baselines and performs favorably compared to state-of-the-art methods on four crowd counting datasets, ShanghaiTech A & B, UCF-QNRF, and JHU-Crowd++.