Abstract:We present an end-to-end deep Convolutional Neural Network called Convolutional Relational Machine (CRM) for recognizing group activities that utilizes the information in spatial relations between individual persons in image or video. It learns to produce an intermediate spatial representation (activity map) based on individual and group activities. A multi-stage refinement component is responsible for decreasing the incorrect predictions in the activity map. Finally, an aggregation component uses the refined information to recognize group activities. Experimental results demonstrate the constructive contribution of the information extracted and represented in the form of the activity map. CRM shows advantages over state-of-the-art models on Volleyball and Collective Activity datasets.
Abstract:In this work, we present a framework based on multi-stream convolutional neural networks (CNNs) for group activity recognition. Streams of CNNs are separately trained on different modalities and their predictions are fused at the end. Each stream has two branches to predict the group activity based on person and scene level representations. A new modality based on the human pose estimation is presented to add extra information to the model. We evaluate our method on the Volleyball and Collective Activity datasets. Experimental results show that the proposed framework is able to achieve state-of-the-art results when multiple or single frames are given as input to the model with 90.50% and 86.61% accuracy on Volleyball dataset, respectively, and 87.01% accuracy of multiple frames group activity on Collective Activity dataset.
Abstract:The overwhelming popularity of social media has resulted in bulk amounts of personal photos being uploaded to the internet every day. Since these photos are taken in unconstrained settings, recognizing the identities of people among the photos remains a challenge. Studies have indicated that utilizing evidence other than face appearance improves the performance of person recognition systems. In this work, we aim to take advantage of additional cues obtained from different body regions in a zooming in fashion for person recognition. Hence, we present Zoom-RNN, a novel method based on recurrent neural networks for combining evidence extracted from the whole body, upper body, and head regions. Our model is evaluated on a challenging dataset, namely People In Photo Albums (PIPA), and we demonstrate that employing our system improves the performance of conventional fusion methods by a noticeable margin.