Abstract:This paper introduces a new and challenging Hidden Intention Discovery (HID) task. Unlike existing intention recognition tasks, which are based on obvious visual representations to identify common intentions for normal behavior, HID focuses on discovering hidden intentions when humans try to hide their intentions for abnormal behavior. HID presents a unique challenge in that hidden intentions lack the obvious visual representations to distinguish them from normal intentions. Fortunately, from a sociological and psychological perspective, we find that the difference between hidden and normal intentions can be reasoned from multiple micro-behaviors, such as gaze, attention, and facial expressions. Therefore, we first discover the relationship between micro-behavior and hidden intentions and use graph structure to reason about hidden intentions. To facilitate research in the field of HID, we also constructed a seminal dataset containing a hidden intention annotation of a typical theft scenario for HID. Extensive experiments show that the proposed network improves performance on the HID task by 9.9\% over the state-of-the-art method SBP.
Abstract:Previous group activity recognition approaches were limited to reasoning using human relations or finding important subgroups and tended to ignore indispensable group composition and human-object interactions. This absence makes a partial interpretation of the scene and increases the interference of irrelevant actions on the results. Therefore, we propose our DynamicFormer with Dynamic composition Module (DcM) and Dynamic interaction Module (DiM) to model relations and locations of persons and discriminate the contribution of participants, respectively. Our findings on group composition and human-object interaction inspire our core idea. Group composition tells us the location of people and their relations inside the group, while interaction reflects the relation between humans and objects outside the group. We utilize spatial and temporal encoders in DcM to model our dynamic composition and build DiM to explore interaction with a novel GCN, which has a transformer inside to consider the temporal neighbors of human/object. Also, a Multi-level Dynamic Integration is employed to integrate features from different levels. We conduct extensive experiments on two public datasets and show that our method achieves state-of-the-art.