Abstract:Purpose: Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set, resulting in poor generalizability. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing (geometric understanding). This approach takes advantage of the recent vision foundation models that ensure reliable low-level scene understanding to craft DT-based scene representations that support various high-level tasks. Methods: We present a DT-based framework for SPR from videos. The framework employs vision foundation models to extract representations. We embed the representation in place of raw video inputs in the state-of-the-art Surgformer model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Results: Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 51.1 on the challenging CRCD dataset, 96.0 on an internal robotics training dataset, and 64.4 on a highly corrupted Cholec80 test set. Conclusion: Our findings lend support to the thesis that DT-based scene representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness, automate feature extraction, and incorporate interpretability for a more comprehensive framework.
Abstract:We explore whether surgical manipulation tasks can be learned on the da Vinci robot via imitation learning. However, the da Vinci system presents unique challenges which hinder straight-forward implementation of imitation learning. Notably, its forward kinematics is inconsistent due to imprecise joint measurements, and naively training a policy using such approximate kinematics data often leads to task failure. To overcome this limitation, we introduce a relative action formulation which enables successful policy training and deployment using its approximate kinematics data. A promising outcome of this approach is that the large repository of clinical data, which contains approximate kinematics, may be directly utilized for robot learning without further corrections. We demonstrate our findings through successful execution of three fundamental surgical tasks, including tissue manipulation, needle handling, and knot-tying.
Abstract:The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.
Abstract:There is increasing interest in the application large language models (LLMs) to the medical field, in part because of their impressive performance on medical exam questions. While promising, exam questions do not reflect the complexity of real patient-doctor interactions. In reality, physicians' decisions are shaped by many complex factors, such as patient compliance, personal experience, ethical beliefs, and cognitive bias. Taking a step toward understanding this, our hypothesis posits that when LLMs are confronted with clinical questions containing cognitive biases, they will yield significantly less accurate responses compared to the same questions presented without such biases. In this study, we developed BiasMedQA, a benchmark for evaluating cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and the medically specialized PMC Llama 13B. We tested these models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to replicate common clinically-relevant cognitive biases. Our analysis revealed varying effects for biases on these LLMs, with GPT-4 standing out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias. Our findings highlight the critical need for bias mitigation in the development of medical LLMs, pointing towards safer and more reliable applications in healthcare.
Abstract:The dominant paradigm for end-to-end robot learning focuses on optimizing task-specific objectives that solve a single robotic problem such as picking up an object or reaching a target position. However, recent work on high-capacity models in robotics has shown promise toward being trained on large collections of diverse and task-agnostic datasets of video demonstrations. These models have shown impressive levels of generalization to unseen circumstances, especially as the amount of data and the model complexity scale. Surgical robot systems that learn from data have struggled to advance as quickly as other fields of robot learning for a few reasons: (1) there is a lack of existing large-scale open-source data to train models, (2) it is challenging to model the soft-body deformations that these robots work with during surgery because simulation cannot match the physical and visual complexity of biological tissue, and (3) surgical robots risk harming patients when tested in clinical trials and require more extensive safety measures. This perspective article aims to provide a path toward increasing robot autonomy in robot-assisted surgery through the development of a multi-modal, multi-task, vision-language-action model for surgical robots. Ultimately, we argue that surgical robots are uniquely positioned to benefit from general-purpose models and provide three guiding actions toward increased autonomy in robot-assisted surgery.
Abstract:A surgeon's physiological hand tremor can significantly impact the outcome of delicate and precise retinal surgery, such as retinal vein cannulation (RVC) and epiretinal membrane peeling. Robot-assisted eye surgery technology provides ophthalmologists with advanced capabilities such as hand tremor cancellation, hand motion scaling, and safety constraints that enable them to perform these otherwise challenging and high-risk surgeries with high precision and safety. Steady-Hand Eye Robot (SHER) with cooperative control mode can filter out surgeon's hand tremor, yet another important safety feature, that is, minimizing the contact force between the surgical instrument and sclera surface for avoiding tissue damage cannot be met in this control mode. Also, other capabilities, such as hand motion scaling and haptic feedback, require a teleoperation control framework. In this work, for the first time, we implemented a teleoperation control mode incorporated with an adaptive sclera force control algorithm using a PHANTOM Omni haptic device and a force-sensing surgical instrument equipped with Fiber Bragg Grating (FBG) sensors attached to the SHER 2.1 end-effector. This adaptive sclera force control algorithm allows the robot to dynamically minimize the tool-sclera contact force. Moreover, for the first time, we compared the performance of the proposed adaptive teleoperation mode with the cooperative mode by conducting a vessel-following experiment inside an eye phantom under a microscope.
Abstract:We consider a micromanipulation problem in eye surgery, specifically retinal vein cannulation (RVC). RVC involves inserting a microneedle into a retinal vein for the purpose of targeted drug delivery. The procedure requires accurately guiding a needle to a target vein and inserting it while avoiding damage to the surrounding tissues. RVC can be considered similar to the reach or push task studied in robotics manipulation, but with additional constraints related to precision and safety while interacting with soft tissues. Prior works have mainly focused developing robotic hardware and sensors to enhance the surgeons' accuracy, leaving the automation of RVC largely unexplored. In this paper, we present the first autonomous strategy for RVC while relying on a minimal setup: a robotic arm, a needle, and monocular images. Our system exclusively relies on monocular vision to achieve precise navigation, gentle placement on the target vein, and safe insertion without causing tissue damage. Throughout the procedure, we employ machine learning for perception and to identify key surgical events such as needle-vein contact and vein punctures. Detecting these events guides our task and motion planning framework, which generates safe trajectories using model predictive control to complete the procedure. We validate our system through 24 successful autonomous trials on 4 cadaveric pig eyes. We show that our system can navigate to target veins within 22 micrometers of XY accuracy and under 35 seconds, and consistently puncture the target vein without causing tissue damage. Preliminary comparison to a human demonstrates the superior accuracy and reliability of our system.
Abstract:We propose a general strategy for autonomous guidance and insertion of a needle into a retinal blood vessel. The main challenges underpinning this task are the accurate placement of the needle-tip on the target vein and a careful needle insertion maneuver to avoid double-puncturing the vein, while dealing with challenging kinematic constraints and depth-estimation uncertainty. Following how surgeons perform this task purely based on visual feedback, we develop a system which relies solely on \emph{monocular} visual cues by combining data-driven kinematic and contact estimation, visual-servoing, and model-based optimal control. By relying on both known kinematic models, as well as deep-learning based perception modules, the system can localize the surgical needle tip and detect needle-tissue interactions and venipuncture events. The outputs from these perception modules are then combined with a motion planning framework that uses visual-servoing and optimal control to cannulate the target vein, while respecting kinematic constraints that consider the safety of the procedure. We demonstrate that we can reliably and consistently perform needle insertion in the domain of retinal surgery, specifically in performing retinal vein cannulation. Using cadaveric pig eyes, we demonstrate that our system can navigate to target veins within 22$\mu m$ XY accuracy and perform the entire procedure in less than 35 seconds on average, and all 24 trials performed on 4 pig eyes were successful. Preliminary comparison study against a human operator show that our system is consistently more accurate and safer, especially during safety-critical needle-tissue interactions. To the best of the authors' knowledge, this work accomplishes a first demonstration of autonomous retinal vein cannulation at a clinically-relevant setting using animal tissues.
Abstract:Recent technological advancements in retinal surgery has led to the modern operating room consisting of a surgical robot, microscope, and intraoperative optical coherence tomography (iOCT). The integration of these tools raises the fundamental question of how to effectively combine them to enable surgical autonomy. In this work, we address this question by developing a unified framework that enables real-time autonomous surgical workflows utilizing the aforementioned devices. To achieve this, we make the following contributions: (1) we develop a novel imaging system that integrates microscopy and iOCT in real-time, accomplished by dynamically tracking the surgical instrument via a small iOCT scanning region (e.g. B-scan), which was not previously possible; (2) implementing various convolutional neural networks (CNN) that automatically segment and detect task-relevant information for surgical autonomy; (3) enabling surgeons to intuitively select goal waypoints within both the microscope and iOCT views through simple mouse-click interactions; (4) integrating model predictive control (MPC) for real-time trajectory generation that respects kinematic constraints to ensure patient safety. We show the utility of our system by tackling subretinal injection (SI), a challenging procedure that involves inserting a microneedle below the retinal tissue for targeted drug delivery, a task surgeons find challenging due to requiring tens-of-micrometers of accuracy and precise depth perception. We validate our system by conducting 30 successful SI trials on pig eyes, achieving needle insertion accuracy of $26 \pm 12 \mu m$ to various subretinal goals and duration of $55 \pm 10.8$ seconds. Preliminary comparisons to a human operator performing SI in robot-assisted mode highlight the enhanced safety of our system.
Abstract:Important challenges in retinal microsurgery include prolonged operating time, inadequate force feedback, and poor depth perception due to a constrained top-down view of the surgery. The introduction of robot-assisted technology could potentially deal with such challenges and improve the surgeon's performance. Motivated by such challenges, this work develops a strategy for autonomous needle navigation in retinal microsurgery aiming to achieve precise manipulation, reduced end-to-end surgery time, and enhanced safety. This is accomplished through real-time geometry estimation and chance-constrained Model Predictive Control (MPC) resulting in high positional accuracy while keeping scleral forces within a safe level. The robotic system is validated using both open-sky and intact (with lens and partial vitreous removal) ex vivo porcine eyes. The experimental results demonstrate that the generation of safe control trajectories is robust to small motions associated with head drift. The mean navigation time and scleral force for MPC navigation experiments are 7.208 s and 11.97 mN, which can be considered efficient and well within acceptable safe limits. The resulting mean errors along lateral directions of the retina are below 0.06 mm, which is below the typical hand tremor amplitude in retinal microsurgery.