Abstract:Natural language offers a convenient, flexible interface for controlling robotic C-arm X-ray systems, making advanced functionality and controls accessible. However, enabling language interfaces requires specialized AI models that interpret X-ray images to create a semantic representation for reasoning. The fixed outputs of such AI models limit the functionality of language controls. Incorporating flexible, language-aligned AI models prompted through language enables more versatile interfaces for diverse tasks and procedures. Using a language-aligned foundation model for X-ray image segmentation, our system continually updates a patient digital twin based on sparse reconstructions of desired anatomical structures. This supports autonomous capabilities such as visualization, patient-specific viewfinding, and automatic collimation from novel viewpoints, enabling commands 'Focus in on the lower lumbar vertebrae.' In a cadaver study, users visualized, localized, and collimated structures across the torso using verbal commands, achieving 84% end-to-end success. Post hoc analysis of randomly oriented images showed our patient digital twin could localize 35 commonly requested structures to within 51.68 mm, enabling localization and isolation from arbitrary orientations. Our results demonstrate how intelligent robotic X-ray systems can incorporate physicians' expressed intent directly. While existing foundation models for intra-operative X-ray analysis exhibit failure modes, as they improve, they can facilitate highly flexible, intelligent robotic C-arms.
Abstract:Surgical phase recognition is essential for analyzing procedure-specific surgical videos. While recent transformer-based architectures have advanced sequence processing capabilities, they struggle with maintaining consistency across lengthy surgical procedures. Drawing inspiration from classical hidden Markov models' finite-state interpretations, we introduce the neural finite-state machine (NFSM) module, which bridges procedural understanding with deep learning approaches. NFSM combines procedure-level understanding with neural networks through global state embeddings, attention-based dynamic transition tables, and transition-aware training and inference mechanisms for offline and online applications. When integrated into our future-aware architecture, NFSM improves video-level accuracy, phase-level precision, recall, and Jaccard indices on Cholec80 datasets by 2.3, 3.2, 3.0, and 4.8 percentage points respectively. As an add-on module to existing state-of-the-art models like Surgformer, NFSM further enhances performance, demonstrating its complementary value. Extended experiments on non-surgical datasets validate NFSM's generalizability beyond surgical domains. Comprehensive experiments demonstrate that incorporating NSFM into deep learning frameworks enables more robust and consistent phase recognition across long procedural videos.
Abstract:Robustness audits of deep neural networks (DNN) provide a means to uncover model sensitivities to the challenging real-world imaging conditions that significantly degrade DNN performance in-the-wild. Such conditions are often the result of the compounding of multiple factors inherent to the environment, sensor, or processing pipeline and may lead to complex image distortions that are not easily categorized. When robustness audits are limited to a set of pre-determined imaging effects or distortions, the results cannot be (easily) transferred to real-world conditions where image corruptions may be more complex or nuanced. To address this challenge, we present a new alternative robustness auditing method that uses causal inference to measure DNN sensitivities to the factors of the imaging process that cause complex distortions. Our approach uses causal models to explicitly encode assumptions about the domain-relevant factors and their interactions. Then, through extensive experiments on natural and rendered images across multiple vision tasks, we show that our approach reliably estimates causal effects of each factor on DNN performance using observational domain data. These causal effects directly tie DNN sensitivities to observable properties of the imaging pipeline in the domain of interest towards reducing the risk of unexpected DNN failures when deployed in that domain.
Abstract:Purpose: Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set, resulting in poor generalizability. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing (geometric understanding). This approach takes advantage of the recent vision foundation models that ensure reliable low-level scene understanding to craft DT-based scene representations that support various high-level tasks. Methods: We present a DT-based framework for SPR from videos. The framework employs vision foundation models to extract representations. We embed the representation in place of raw video inputs in the state-of-the-art Surgformer model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Results: Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 51.1 on the challenging CRCD dataset, 96.0 on an internal robotics training dataset, and 64.4 on a highly corrupted Cholec80 test set. Conclusion: Our findings lend support to the thesis that DT-based scene representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness, automate feature extraction, and incorporate interpretability for a more comprehensive framework.
Abstract:Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high- or low-levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language-driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM-based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM-guided Monte Carlo Tree Search with self reflection. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state-of-the-art LLM-driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM-guided Tree Search to enable more robust automation in complex state and action spaces.
Abstract:In percutaneous pelvic trauma surgery, accurate placement of Kirschner wires (K-wires) is crucial to ensure effective fracture fixation and avoid complications due to breaching the cortical bone along an unsuitable trajectory. Surgical navigation via mixed reality (MR) can help achieve precise wire placement in a low-profile form factor. Current approaches in this domain are as yet unsuitable for real-world deployment because they fall short of guaranteeing accurate visual feedback due to uncontrolled bending of the wire. To ensure accurate feedback, we introduce StraightTrack, an MR navigation system designed for percutaneous wire placement in complex anatomy. StraightTrack features a marker body equipped with a rigid access cannula that mitigates wire bending due to interactions with soft tissue and a covered bony surface. Integrated with an Optical See-Through Head-Mounted Display (OST HMD) capable of tracking the cannula body, StraightTrack offers real-time 3D visualization and guidance without external trackers, which are prone to losing line-of-sight. In phantom experiments with two experienced orthopedic surgeons, StraightTrack improves wire placement accuracy, achieving the ideal trajectory within $5.26 \pm 2.29$ mm and $2.88 \pm 1.49$ degree, compared to over 12.08 mm and 4.07 degree for comparable methods. As MR navigation systems continue to mature, StraightTrack realizes their potential for internal fracture fixation and other percutaneous orthopedic procedures.
Abstract:Arthroscopy is a minimally invasive surgical procedure used to diagnose and treat joint problems. The clinical workflow of arthroscopy typically involves inserting an arthroscope into the joint through a small incision, during which surgeons navigate and operate largely by relying on their visual assessment through the arthroscope. However, the arthroscope's restricted field of view and lack of depth perception pose challenges in navigating complex articular structures and achieving surgical precision during procedures. Aiming at enhancing intraoperative awareness, we present a robust pipeline that incorporates simultaneous localization and mapping, depth estimation, and 3D Gaussian splatting to realistically reconstruct intra-articular structures solely based on monocular arthroscope video. Extending 3D reconstruction to Augmented Reality (AR) applications, our solution offers AR assistance for articular notch measurement and annotation anchoring in a human-in-the-loop manner. Compared to traditional Structure-from-Motion and Neural Radiance Field-based methods, our pipeline achieves dense 3D reconstruction and competitive rendering fidelity with explicit 3D representation in 7 minutes on average. When evaluated on four phantom datasets, our method achieves RMSE = 2.21mm reconstruction error, PSNR = 32.86 and SSIM = 0.89 on average. Because our pipeline enables AR reconstruction and guidance directly from monocular arthroscopy without any additional data and/or hardware, our solution may hold the potential for enhancing intraoperative awareness and facilitating surgical precision in arthroscopy. Our AR measurement tool achieves accuracy within 1.59 +/- 1.81mm and the AR annotation tool achieves a mIoU of 0.721.
Abstract:Fully supervised deep learning (DL) models for surgical video segmentation have been shown to struggle with non-adversarial, real-world corruptions of image quality including smoke, bleeding, and low illumination. Foundation models for image segmentation, such as the segment anything model (SAM) that focuses on interactive prompt-based segmentation, move away from semantic classes and thus can be trained on larger and more diverse data, which offers outstanding zero-shot generalization with appropriate user prompts. Recently, building upon this success, SAM-2 has been proposed to further extend the zero-shot interactive segmentation capabilities from independent frame-by-frame to video segmentation. In this paper, we present a first experimental study evaluating SAM-2's performance on surgical video data. Leveraging the SegSTRONG-C MICCAI EndoVIS 2024 sub-challenge dataset, we assess SAM-2's effectiveness on uncorrupted endoscopic sequences and evaluate its non-adversarial robustness on videos with corrupted image quality simulating smoke, bleeding, and low brightness conditions under various prompt strategies. Our experiments demonstrate that SAM-2, in zero-shot manner, can achieve competitive or even superior performance compared to fully-supervised deep learning models on surgical video data, including under non-adversarial corruptions of image quality. Additionally, SAM-2 consistently outperforms the original SAM and its medical variants across all conditions. Finally, frame-sparse prompting can consistently outperform frame-wise prompting for SAM-2, suggesting that allowing SAM-2 to leverage its temporal modeling capabilities leads to more coherent and accurate segmentation compared to frequent prompting.
Abstract:Brain tumor analysis in Magnetic Resonance Imaging (MRI) is crucial for accurate diagnosis and treatment planning. However, the task remains challenging due to the complexity and variability of tumor appearances, as well as the scarcity of labeled data. Traditional approaches often address tumor segmentation and image generation separately, limiting their effectiveness in capturing the intricate relationships between healthy and pathological tissue structures. We introduce a novel promptable counterfactual diffusion model as a unified solution for brain tumor segmentation and generation in MRI. The key innovation lies in our mask-level prompting mechanism at the sampling stage, which enables guided generation and manipulation of specific healthy or unhealthy regions in MRI images. Specifically, the model's architecture allows for bidirectional inference, which can segment tumors in existing images and generate realistic tumor structures in healthy brain scans. Furthermore, we present a two-step approach for tumor generation and position transfer, showcasing the model's versatility in synthesizing realistic tumor structures. Experiments on the BRATS2021 dataset demonstrate that our method outperforms traditional counterfactual diffusion approaches, achieving a mean IoU of 0.653 and mean Dice score of 0.785 for tumor segmentation, outperforming the 0.344 and 0.475 of conventional counterfactual diffusion model. Our work contributes to improving brain tumor detection and segmentation accuracy, with potential implications for data augmentation and clinical decision support in neuro-oncology. The code is available at https://github.com/arcadelab/counterfactual_diffusion.
Abstract:Accurate segmentation of anatomical structures and pathological regions in medical images is crucial for diagnosis, treatment planning, and disease monitoring. While the Segment Anything Model (SAM) and its variants have demonstrated impressive interactive segmentation capabilities on image types not seen during training without the need for domain adaptation or retraining, their practical application in volumetric 3D medical imaging workflows has been hindered by the lack of a user-friendly interface. To address this challenge, we introduce FastSAM-3DSlicer, a 3D Slicer extension that integrates both 2D and 3D SAM models, including SAM-Med2D, MedSAM, SAM-Med3D, and FastSAM-3D. Building on the well-established open-source 3D Slicer platform, our extension enables efficient, real-time segmentation of 3D volumetric medical images, with seamless interaction and visualization. By automating the handling of raw image data, user prompts, and segmented masks, FastSAM-3DSlicer provides a streamlined, user-friendly interface that can be easily incorporated into medical image analysis workflows. Performance evaluations reveal that the FastSAM-3DSlicer extension running FastSAM-3D achieves low inference times of only 1.09 seconds per volume on CPU and 0.73 seconds per volume on GPU, making it well-suited for real-time interactive segmentation. Moreover, we introduce an uncertainty quantification scheme that leverages the rapid inference capabilities of FastSAM-3D for practical implementation, further enhancing its reliability and applicability in medical settings. FastSAM-3DSlicer offers an interactive platform and user interface for 2D and 3D interactive volumetric medical image segmentation, offering a powerful combination of efficiency, precision, and ease of use with SAMs. The source code and a video demonstration are publicly available at https://github.com/arcadelab/FastSAM3D_slicer.