Abstract:Human-object interaction (HOI) detection for capturing relationships between humans and objects is an important task in the semantic understanding of images. When processing human and object keypoints extracted from an image using a graph convolutional network (GCN) to detect HOI, it is crucial to extract appropriate object keypoints regardless of the object type and to design a GCN that accurately captures the spatial relationships between keypoints. This paper presents the human and object keypoint-based extension module (HOKEM) as an easy-to-use extension module to improve the accuracy of the conventional detection models. The proposed object keypoint extraction method is simple yet accurately represents the shapes of various objects. Moreover, the proposed human-object adaptive GCN (HO-AGCN), which introduces adaptive graph optimization and attention mechanism, accurately captures the spatial relationships between keypoints. Experiments using the HOI dataset, V-COCO, showed that HOKEM boosted the accuracy of an appearance-based model by a large margin.
Abstract:Skeleton-based two-person interaction recognition has been gaining increasing attention as advancements are made in pose estimation and graph convolutional networks. Although the accuracy has been gradually improving, the increasing computational complexity makes it more impractical for a real-world environment. There is still room for accuracy improvement as the conventional methods do not fully represent the relationship between inter-body joints. In this paper, we propose a lightweight model for accurately recognizing two-person interactions. In addition to the architecture, which incorporates middle fusion, we introduce a factorized convolution technique to reduce the weight parameters of the model. We also introduce a network stream that accounts for relative distance changes between inter-body joints to improve accuracy. Experiments using two large-scale datasets, NTU RGB+D 60 and 120, show that our method simultaneously achieved the highest accuracy and relatively low computational complexity compared with the conventional methods.