Abstract:Purpose: Automated segmentation of anatomical structures in medical image analysis is a key step in defining topology to enable or assist autonomous intervention robots. Recent methods based on deep convolutional neural networks (CNN) have outperformed former heuristic methods. However, those methods were primarily evaluated on rigid, real-world environments. In this study, we evaluate existing segmentation methods for their use with soft tissue. Methods: The four CNN-based methods SegNet, UNet, ENet and ErfNet are trained with high supervision on a novel 7-class dataset of surgeries on the human larynx. The dataset contains 400 manually segmented images from two patients during laser incisions. The Intersection-over-Union (IoU) evaluation metric is used to measure the accuracy of each method. Data augmentation and network ensembling is employed to increase segmentation accuracy. Stochastic inference is used to show the uncertainty of the individual models. Results: Our study shows that an average ensemble network of UNet and ErfNet is best suited for laryngeal soft tissue segmentation with a mean IoU of 84.7 %. The highest efficiency is achieved by ENet with a mean inference time of 9.22 ms per image on an NVIDIA GeForce GTX 1080 Ti GPU. All methods can be improved by data augmentation. Conclusion: CNN-based methods for semantic segmentation are applicable to laryngeal soft tissue. The segmentation can be used for active constraints or autonomous control in robot-assisted laser surgery. Further improvements could be achieved by using a larger dataset or training the models in a self-supervised manner on additional unlabeled data.