Abstract:Automated segmentation of medical images highly depends on the availability of accurate manual image annotations. Such annotations are very time-consuming and costly to generate, and often require specialized expertise, particularly for cross-sectional images which contain many slices for each patient. It is crucial to ensure the best use of annotation resources. In this paper, we systematically answer the question of how to select slices of cross-sectional medical images in order to maximize performance of the resulting deep learning segmentation models. We conducted experiments on 4 medical imaging segmentation tasks with varying annotation budgets, numbers of annotated cases, numbers of annotated slices per volume, slice selection techniques, and mask interpolations. We found that: 1) It is almost always preferable to annotate fewer slices per volume and more volumes given an annotation budget. 2) Selecting slices for annotation by unsupervised active learning (UAL) is not superior to selecting slices randomly or at fixed intervals, provided that each volume is allocated the same number of annotated slices. 3) Interpolating masks between annotated slices rarely enhances model performance, with exceptions of some specific configuration for 3D models.
Abstract:Segment Anything Model (SAM) has shown impressive performance in interactive and zero-shot segmentation across diverse domains, suggesting that they have learned a general concept of "objects" from their large-scale training. However, we observed that SAM struggles with certain types of objects, particularly those featuring dense, tree-like structures and low textural contrast from their surroundings. These failure modes are critical for understanding its limitations in real-world use. In order to systematically examine this issue, we propose metrics to quantify two key object characteristics: tree-likeness and textural separability. Through extensive controlled synthetic experiments and testing on real datasets, we demonstrate that SAM's performance is noticeably correlated with these factors. We link these behaviors under the concept of "textural confusion", where SAM misinterprets local structure as global texture, leading to over-segmentation, or struggles to differentiate objects from similarly textured backgrounds. These findings offer the first quantitative framework to model SAM's challenges, providing valuable insights into its limitations and guiding future improvements for vision foundation models.
Abstract:Traffic-related injuries and fatalities are major health risks in the United States. Mobile phone use while driving quadruples the risk for a motor vehicle crash. This work demonstrates the feasibility of using the mobile phone camera to passively detect the location of the phone's user within a vehicle. In a large, varied dataset we were able correctly identify if the user was in the driver's seat or one of the passenger seats with 94.9% accuracy. This model could be used by application developers to selectively change or lock functionality while a user is driving, but not if the user is a passenger in a moving vehicle.