Abstract:Flexible electronic skins that simultaneously sense touch and bend are desired in several application areas, such as to cover articulated robot structures. This paper introduces a flexible tactile sensor based on Electrical Impedance Tomography (EIT), capable of simultaneously detecting and measuring contact forces and flexion of the sensor. The sensor integrates a magnetic hydrogel composite and utilizes EIT to reconstruct internal conductivity distributions. Real-time estimation is achieved through the one-step Gauss-Newton method, which dynamically updates reference voltages to accommodate sensor deformation. A convolutional neural network is employed to classify interactions, distinguishing between touch, bending, and idle states using pre-reconstructed images. Experimental results demonstrate an average touch localization error of 5.4 mm (SD 2.2 mm) and average bending angle estimation errors of 1.9$^\circ$ (SD 1.6$^\circ$). The proposed adaptive reference method effectively distinguishes between single- and multi-touch scenarios while compensating for deformation effects. This makes the sensor a promising solution for multimodal sensing in robotics and human-robot collaboration.
Abstract:This paper presents a dual-channel tactile skin that integrates Electrical Impedance Tomography (EIT) with air pressure sensing to achieve accurate multi-contact force detection. The EIT layer provides spatial contact information, while the air pressure sensor delivers precise total force measurement. Our framework combines these complementary modalities through: deep learning-based EIT image reconstruction, contact area segmentation, and force allocation based on relative conductivity intensities from EIT. The experiments demonstrated 15.1% average force estimation error in single-contact scenarios and 20.1% in multi-contact scenarios without extensive calibration data requirements. This approach effectively addresses the challenge of simultaneously localizing and quantifying multiple contact forces without requiring complex external calibration setups, paving the way for practical and scalable soft robotic skin applications.
Abstract:Current embodied reasoning agents struggle to plan for long-horizon tasks that require to physically interact with the world to obtain the necessary information (e.g. 'sort the objects from lightest to heaviest'). The improvement of the capabilities of such an agent is highly dependent on the availability of relevant training environments. In order to facilitate the development of such systems, we introduce a novel simulation environment (built on top of robosuite) that makes use of the MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. It is the first simulator focusing on long-horizon robot manipulation tasks preserving accurate physics modeling. MuBlE can generate mutlimodal data for training and enable design of closed-loop methods through environment interaction on two levels: visual - action loop, and control - physics loop. Together with the simulator, we propose SHOP-VRB2, a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements.
Abstract:Active vision enables dynamic visual perception, offering an alternative to static feedforward architectures in computer vision, which rely on large datasets and high computational resources. Biological selective attention mechanisms allow agents to focus on salient Regions of Interest (ROIs), reducing computational demand while maintaining real-time responsiveness. Event-based cameras, inspired by the mammalian retina, enhance this capability by capturing asynchronous scene changes enabling efficient low-latency processing. To distinguish moving objects while the event-based camera is in motion the agent requires an object motion segmentation mechanism to accurately detect targets and center them in the visual field (fovea). Integrating event-based sensors with neuromorphic algorithms represents a paradigm shift, using Spiking Neural Networks to parallelize computation and adapt to dynamic environments. This work presents a Spiking Convolutional Neural Network bioinspired attention system for selective attention through object motion sensitivity. The system generates events via fixational eye movements using a Dynamic Vision Sensor integrated into the Speck neuromorphic hardware, mounted on a Pan-Tilt unit, to identify the ROI and saccade toward it. The system, characterized using ideal gratings and benchmarked against the Event Camera Motion Segmentation Dataset, reaches a mean IoU of 82.2% and a mean SSIM of 96% in multi-object motion segmentation. The detection of salient objects reaches 88.8% accuracy in office scenarios and 89.8% in low-light conditions on the Event-Assisted Low-Light Video Object Segmentation Dataset. A real-time demonstrator shows the system's 0.12 s response to dynamic scenes. Its learning-free design ensures robustness across perceptual scenes, making it a reliable foundation for real-time robotic applications serving as a basis for more complex architectures.
Abstract:Depth sensing is an essential technology in robotics and many other fields. Many depth sensing (or RGB-D) cameras are available on the market and selecting the best one for your application can be challenging. In this work, we tested four stereoscopic RGB-D cameras that sense the distance by using two images from slightly different views. We empirically compared four cameras (Intel RealSense D435, Intel RealSense D455, StereoLabs ZED 2, and Luxonis OAK-D Pro) in three scenarios: (i) planar surface perception, (ii) plastic doll perception, (iii) household object perception (YCB dataset). We recorded and evaluated more than 3,000 RGB-D frames for each camera. For table-top robotics scenarios with distance to objects up to one meter, the best performance is provided by the D435 camera. For longer distances, the other three models perform better, making them more suitable for some mobile robotics applications. OAK-D Pro additionally offers integrated AI modules (e.g., object and human keypoint detection). ZED 2 is not a standalone device and requires a computer with a GPU for depth data acquisition. All data (more than 12,000 RGB-D frames) are publicly available at https://osf.io/f2seb.
Abstract:Manipulability analysis is a methodology employed to assess the capacity of an articulated system, at a specific configuration, to produce motion or exert force in diverse directions. The conventional method entails generating a virtual ellipsoid using the system's configuration and model. Yet, this approach poses challenges when applied to systems such as the human body, where direct access to such information is limited, necessitating reliance on estimations. Any inaccuracies in these estimations can distort the ellipsoid's configuration, potentially compromising the accuracy of the manipulability assessment. To address this issue, this article extends the standard approach by introducing the concept of the manipulability pseudo-ellipsoid. Through a series of theoretical analyses, simulations, and experiments, the article demonstrates that the proposed method exhibits reduced sensitivity to noise in sensory information, consequently enhancing the robustness of the approach.
Abstract:What is considered safe for a robot operator during physical human-robot collaboration (HRC) is specified in corresponding HRC standards (e.g., the European ISO/TS 15066). The regime that allows collisions between the moving robot and the operator, called Power and Force Limiting (PFL), restricts the permissible contact forces. Using the same fixed contact thresholds on the entire robot surface results in significant and unnecessary productivity losses, as the robot needs to stop even when impact forces are within limits. Here we present a framework for setting the protective skin thresholds individually for different parts of the robot body and dynamically on the fly, based on the effective mass of each robot link and the link velocity. We perform experiments on a 6-axis collaborative robot arm (UR10e) completely covered with a sensitive skin (AIRSKIN) consisting of eleven individual pads. On a mock pick-and-place scenario with both transient and quasi-static collisions, we demonstrate how skin sensitivity influences the task performance and exerted force. We show an increase in productivity of almost 50% from the most conservative setting of collision thresholds to the most adaptive setting, while ensuring safety for human operators. The method is applicable to any robot for which the effective mass can be calculated.
Abstract:Artificial electronic skins covering complete robot bodies can make physical human-robot collaboration safe and hence possible. Standards for collaborative robots (e.g., ISO/TS 15066) prescribe permissible forces and pressures during contacts with the human body. These characteristics of the collision depend on the speed of the colliding robot link but also on its effective mass. Thus, to warrant contacts complying with the Power and Force Limiting (PFL) collaborative regime but at the same time maximizing productivity, protective skin thresholds should be set individually for different parts of the robot bodies and dynamically on the run. Here we present and empirically evaluate four scenarios: (a) static and uniform - fixed thresholds for the whole skin, (b) static but different settings for robot body parts, (c) dynamically set based on every link velocity, (d) dynamically set based on effective mass of every robot link. We perform experiments in simulation and on a real 6-axis collaborative robot arm (UR10e) completely covered with sensitive skin (AIRSKIN) comprising eleven individual pads. On a mock pick-and-place scenario with transient collisions with the robot body parts and two collision reactions (stop and avoid), we demonstrate the boost in productivity in going from the most conservative setting of the skin thresholds (a) to the most adaptive setting (d). The threshold settings for every skin pad are adapted with a frequency of 25 Hz. This work can be easily extended for platforms with more degrees of freedom and larger skin coverage (humanoids) and to social human-robot interaction scenarios where contacts with the robot will be used for communication.
Abstract:Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies "in the wild", facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets featuring adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (object keypoint similarity, average precision and recall), we introduce errors expressed in the neck-mid-hip ratio and additionally study missed and redundant detections and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and processed data at https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.
Abstract:Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such systems we introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. Together with the simulator we propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements. Finally, we develop a new modular Closed Loop Interactive Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. We extensively evaluate our reasoning approach in simulation and in the real world manipulation tasks with a success rate above 76% and 64%, respectively.