Abstract:Reconstructing the 3D shape of a deformable environment from the information captured by a moving depth camera is highly relevant to surgery. The underlying challenge is the fact that simultaneously estimating camera motion and tissue deformation in a fully deformable scene is an ill-posed problem, especially from a single arbitrarily moving viewpoint. Current solutions are often organ-specific and lack the robustness required to handle large deformations. Here we propose a multi-viewpoint global optimization framework that can flexibly integrate the output of low-level perception modules (data association, depth, and relative scene flow) with kinematic and scene-modeling priors to jointly estimate multiple camera motions and absolute scene flow. We use simulated noisy data to show three practical examples that successfully constrain the convergence to a unique solution. Overall, our method shows robustness to combined noisy input measures and can process hundreds of points in a few milliseconds. MultiViPerFrOG builds a generalized learning-free scaffolding for spatio-temporal encoding that can unlock advanced surgical scene representations and will facilitate the development of the computer-assisted-surgery technologies of the future.
Abstract:New sensing technologies and more advanced processing algorithms are transforming computer-integrated surgery. While researchers are actively investigating depth sensing and 3D reconstruction for vision-based surgical assistance, it remains difficult to achieve real-time, accurate, and robust 3D representations of the abdominal cavity for minimally invasive surgery. Thus, this work uses quantitative testing on fresh ex-vivo porcine tissue to thoroughly characterize the quality with which a 3D laser-based time-of-flight sensor (lidar) can perform anatomical surface reconstruction. Ground-truth surface shapes are captured with a commercial laser scanner, and the resulting signed error fields are analyzed using rigorous statistical tools. When compared to modern learning-based stereo matching from endoscopic images, time-of-flight sensing demonstrates higher precision, lower processing delay, higher frame rate, and superior robustness against sensor distance and poor illumination. Furthermore, we report on the potential negative effect of near-infrared light penetration on the accuracy of lidar measurements across different tissue samples, identifying a significant measured depth offset for muscle in contrast to fat and liver. Our findings highlight the potential of lidar for intraoperative 3D perception and point toward new methods that combine complementary time-of-flight and spectral imaging.
Abstract:Intelligent interaction with the physical world requires perceptual abilities beyond vision and hearing; vibrant tactile sensing is essential for autonomous robots to dexterously manipulate unfamiliar objects or safely contact humans. Therefore, robotic manipulators need high-resolution touch sensors that are compact, robust, inexpensive, and efficient. The soft vision-based haptic sensor presented herein is a miniaturized and optimized version of the previously published sensor Insight. Minsight has the size and shape of a human fingertip and uses machine learning methods to output high-resolution maps of 3D contact force vectors at 60 Hz. Experiments confirm its excellent sensing performance, with a mean absolute force error of 0.07 N and contact location error of 0.6 mm across its surface area. Minsight's utility is shown in two robotic tasks on a 3-DoF manipulator. First, closed-loop force control enables the robot to track the movements of a human finger based only on tactile data. Second, the informative value of the sensor output is shown by detecting whether a hard lump is embedded within a soft elastomer with an accuracy of 98%. These findings indicate that Minsight can give robots the detailed fingertip touch sensing needed for dexterous manipulation and physical human-robot interaction.
Abstract:Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noise, and motion blur make this task difficult. We address this by introducing novel linguistic priors that are universally applicable to SL and provide constraints on 3D hand pose that help resolve ambiguities within isolated signs. Our method, SGNify, captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos. We evaluate SGNify quantitatively by using a commercial motion-capture system to compute 3D avatars synchronized with monocular video. SGNify outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL videos. A perceptual study shows that SGNify's 3D reconstructions are significantly more comprehensible and natural than those of previous methods and are on par with the source videos. Code and data are available at $\href{http://sgnify.is.tue.mpg.de}{\text{sgnify.is.tue.mpg.de}}$.
Abstract:Machine learning and deep learning have been used extensively to classify physical surfaces through images and time-series contact data. However, these methods rely on human expertise and entail the time-consuming processes of data and parameter tuning. To overcome these challenges, we propose an easily implemented framework that can directly handle heterogeneous data sources for classification tasks. Our data-versus-data approach automatically quantifies distinctive differences in distributions in a high-dimensional space via kernel two-sample testing between two sets extracted from multimodal data (e.g., images, sounds, haptic signals). We demonstrate the effectiveness of our technique by benchmarking against expertly engineered classifiers for visual-audio-haptic surface recognition due to the industrial relevance, difficulty, and competitive baselines of this application; ablation studies confirm the utility of key components of our pipeline. As shown in our open-source code, we achieve 97.2% accuracy on a standard multi-user dataset with 108 surface classes, outperforming the state-of-the-art machine-learning algorithm by 6% on a more difficult version of the task. The fact that our classifier obtains this performance with minimal data processing in the standard algorithm setting reinforces the powerful nature of kernel methods for learning to recognize complex patterns.
Abstract:Individuals who use myoelectric upper-limb prostheses often rely heavily on vision to complete their daily activities. They thus struggle in situations where vision is overloaded, such as multitasking, or unavailable, such as poor lighting conditions. Non-amputees can easily accomplish such tasks due to tactile reflexes and haptic sensation guiding their upper-limb motor coordination. Based on these principles, we developed and tested two novel prosthesis systems that incorporate autonomous controllers and provide the user with touch-location feedback through either vibration or distributed pressure. These capabilities were made possible by installing a custom contact-location sensor on thefingers of a commercial prosthetic hand, along with a custom pressure sensor on the thumb. We compared the performance of the two systems against a standard myoelectric prosthesis and a myoelectric prosthesis with only autonomous controllers in a difficult reach-to-pick-and-place task conducted without direct vision. Results from 40 non-amputee participants in this between-subjects study indicated that vibrotactile feedback combined with synthetic reflexes proved significantly more advantageous than the standard prosthesis in several of the task milestones. In addition, vibrotactile feedback and synthetic reflexes improved grasp placement compared to only synthetic reflexes or pressure feedback combined with synthetic reflexes. These results indicate that both autonomous controllers and haptic feedback facilitate success in dexterous tasks without vision, and that the type of haptic display matters.
Abstract:Hugs are complex affective interactions that often include gestures like squeezes. We present six new guidelines for designing interactive hugging robots, which we validate through two studies with our custom robot. To achieve autonomy, we investigated robot responses to four human intra-hug gestures: holding, rubbing, patting, and squeezing. Thirty-two users each exchanged and rated sixteen hugs with an experimenter-controlled HuggieBot 2.0. The robot's inflated torso's microphone and pressure sensor collected data of the subjects' demonstrations that were used to develop a perceptual algorithm that classifies user actions with 88\% accuracy. Users enjoyed robot squeezes, regardless of their performed action, they valued variety in the robot response, and they appreciated robot-initiated intra-hug gestures. From average user ratings, we created a probabilistic behavior algorithm that chooses robot responses in real time. We implemented improvements to the robot platform to create HuggieBot 3.0 and then validated its gesture perception system and behavior algorithm with sixteen users. The robot's responses and proactive gestures were greatly enjoyed. Users found the robot more natural, enjoyable, and intelligent in the last phase of the experiment than in the first. After the study, they felt more understood by the robot and thought robots were nicer to hug.
Abstract:Vision-based haptic sensors have emerged as a promising approach to robotic touch due to affordable high-resolution cameras and successful computer-vision techniques. However, their physical design and the information they provide do not yet meet the requirements of real applications. We present a robust, soft, low-cost, vision-based, thumb-sized 3D haptic sensor named Insight: it continually provides a directional force-distribution map over its entire conical sensing surface. Constructed around an internal monocular camera, the sensor has only a single layer of elastomer over-molded on a stiff frame to guarantee sensitivity, robustness, and soft contact. Furthermore, Insight is the first system to combine photometric stereo and structured light using a collimator to detect the 3D deformation of its easily replaceable flexible outer shell. The force information is inferred by a deep neural network that maps images to the spatial distribution of 3D contact force (normal and shear). Insight has an overall spatial resolution of 0.4 mm, force magnitude accuracy around 0.03 N, and force direction accuracy around 5 degrees over a range of 0.03--2 N for numerous distinct contacts with varying contact area. The presented hardware and software design concepts can be transferred to a wide variety of robot parts.
Abstract:The lack of haptically aware upper-limb prostheses forces amputees to rely largely on visual cues to complete activities of daily living. In contrast, able-bodied individuals inherently rely on conscious haptic perception and automatic tactile reflexes to govern volitional actions in situations that do not allow for constant visual attention. We therefore propose a myoelectric prosthesis system that reflects these concepts to aid manipulation performance without direct vision. To implement this design, we built two fabric-based tactile sensors that measure contact location along the palmar and dorsal sides of the prosthetic fingers and grasp pressure at the tip of the prosthetic thumb. Inspired by the natural sensorimotor system, we use the measurements from these sensors to provide vibrotactile feedback of contact location and implement a tactile grasp controller that uses automatic reflexes to prevent over-grasping and object slip. We compare this system to a standard myoelectric prosthesis in a challenging reach-to-pick-and-place task conducted without direct vision; 17 able-bodied adults took part in this single-session between-subjects study. Participants in the tactile group achieved more consistent high performance compared to participants in the standard group. These results indicate that the addition of contact-location feedback and reflex control increases the consistency with which objects can be grasped and moved without direct vision in upper-limb prosthetics.
Abstract:Receiving a hug is one of the best ways to feel socially supported, and the lack of social touch can have severe negative effects on an individual's well-being. Based on previous research both within and outside of HRI, we propose six tenets ("commandments") of natural and enjoyable robotic hugging: a hugging robot should be soft, be warm, be human sized, visually perceive its user, adjust its embrace to the user's size and position, and reliably release when the user wants to end the hug. Prior work validated the first two tenets, and the final four are new. We followed all six tenets to create a new robotic platform, HuggieBot 2.0, that has a soft, warm, inflated body (HuggieChest) and uses visual and haptic sensing to deliver closed-loop hugging. We first verified the outward appeal of this platform in comparison to the previous PR2-based HuggieBot 1.0 via an online video-watching study involving 117 users. We then conducted an in-person experiment in which 32 users each exchanged eight hugs with HuggieBot 2.0, experiencing all combinations of visual hug initiation, haptic sizing, and haptic releasing. The results show that adding haptic reactivity definitively improves user perception a hugging robot, largely verifying our four new tenets and illuminating several interesting opportunities for further improvement.