Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seyed Hamid Reza Roodabeh

Multimodal Transformers for Real-Time Surgical Activity Prediction

Mar 11, 2024

Keshara Weerasinghe, Seyed Hamid Reza Roodabeh, Kay Hutchinson, Homa Alemzadeh

Figure 1 for Multimodal Transformers for Real-Time Surgical Activity Prediction

Figure 2 for Multimodal Transformers for Real-Time Surgical Activity Prediction

Figure 3 for Multimodal Transformers for Real-Time Surgical Activity Prediction

Figure 4 for Multimodal Transformers for Real-Time Surgical Activity Prediction

Abstract:Real-time recognition and prediction of surgical activities are fundamental to advancing safety and autonomy in robot-assisted surgery. This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories based on short segments of kinematic and video data. We conduct an ablation study to evaluate the impact of fusing different input modalities and their representations on gesture recognition and prediction performance. We perform an end-to-end assessment of the proposed architecture using the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with 89.5\% accuracy for gesture prediction through effective fusion of kinematic features with spatial and contextual video features. It achieves the real-time performance of 1.1-1.3ms for processing a 1-second input window by relying on a computationally efficient model.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions