Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sarah Fleischer

Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Sep 25, 2023

Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah

Figure 1 for Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Figure 2 for Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Figure 3 for Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Figure 4 for Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Abstract:Action recognition from an egocentric viewpoint is a crucial perception task in robotics and enables a wide range of human-robot interactions. While most computer vision approaches prioritize the RGB camera, the Depth modality - which can further amplify the subtleties of actions from an egocentric perspective - remains underexplored. Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. To study this problem, we consider the recent MECCANO dataset, which provides a wide range of assembling actions. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. To address the inherent skewness in real-world multimodal action occurrences, we propose a training strategy using an exponentially decaying variant of the focal loss modulating factor. Additionally, to leverage the information in both RGB and Depth modalities, we opt for late fusion to combine the predictions from each modality. We thoroughly evaluate our method on the action recognition task of the MECCANO dataset, and it significantly outperforms the prior work. Notably, our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.

Via

Access Paper or Ask Questions

Ensemble Modeling for Multimodal Visual Action Recognition

Aug 10, 2023

Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah

Figure 1 for Ensemble Modeling for Multimodal Visual Action Recognition

Figure 2 for Ensemble Modeling for Multimodal Visual Action Recognition

Figure 3 for Ensemble Modeling for Multimodal Visual Action Recognition

Abstract:In this work, we propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset. Based on the underlying principle of focal loss, which captures the relationship between tail (scarce) classes and their prediction difficulties, we propose an exponentially decaying variant of focal loss for our current task. It initially emphasizes learning from the hard misclassified examples and gradually adapts to the entire range of examples in the dataset. This annealing process encourages the model to strike a balance between focusing on the sparse set of hard samples, while still leveraging the information provided by the easier ones. Additionally, we opt for the late fusion strategy to combine the resultant probability distributions from RGB and Depth modalities for final action prediction. Experimental evaluations on the MECCANO dataset demonstrate the effectiveness of our approach.

* Technical Report accepted at the Multimodal Action Recognition Challenge on the MECCANO Dataset - ICIAP 2023

Via

Access Paper or Ask Questions