Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Meng-Hsun Tsai

Semantic2Graph: Graph-based Multi-modal Feature for Action Segmentation in Videos

Sep 20, 2022

Junbin Zhang, Pei-Hsuan Tsai, Meng-Hsun Tsai

Figure 1 for Semantic2Graph: Graph-based Multi-modal Feature for Action Segmentation in Videos

Figure 2 for Semantic2Graph: Graph-based Multi-modal Feature for Action Segmentation in Videos

Figure 3 for Semantic2Graph: Graph-based Multi-modal Feature for Action Segmentation in Videos

Abstract:Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.

* 10 pages, 3 figures, 8 tables. This paper was submitted to IEEE Transactions on Multimedia

Via

Access Paper or Ask Questions