Abstract:In recent years, powerful data-driven deep-learning techniques have been developed and applied for automated catch registration. However, these methods are dependent on the labelled data, which is time-consuming, labour-intensive, expensive to collect and need expert knowledge. In this study, we present an active learning technique, named BoxAL, which includes estimation of epistemic certainty of the Faster R-CNN object-detection model. The method allows selecting the most uncertain training images from an unlabeled pool, which are then used to train the object-detection model. To evaluate the method, we used an open-source image dataset obtained with a dedicated image-acquisition system developed for commercial trawlers targeting demersal species. We demonstrated, that our approach allows reaching the same object-detection performance as with the random sampling using 400 fewer labelled images. Besides, mean AP score was significantly higher at the last training iteration with 1100 training images, specifically, 39.0±1.6 and 34.8±1.8 for certainty-based sampling and random sampling, respectively. Additionally, we showed that epistemic certainty is a suitable method to sample images that the current iteration of the model cannot deal with yet. Our study additionally showed that the sampled new data is more valuable for training than the remaining unlabeled data. Our software is available on https://github.com/pieterblok/boxal.
Abstract:Learning from Demonstration offers great potential for robots to learn to perform agricultural tasks, specifically selective harvesting. One of the challenges is that the target fruit can be oscillating while approaching. Grasping oscillating targets has two requirements: 1) close tracking of the target during the final approach for damage-free grasping, and 2) the complete path should be as short as possible for improved efficiency. We propose a new method called DualLQR. In this method, we use a finite horizon Linear Quadratic Regulator (LQR) on a moving target, without the need of refitting the LQR. To make this possible, we use a dual LQR setup, with an LQR running in two seperate reference frames. Through extensive simulation testing, it was found that the state-of-art method barely meets the required final accuracy without oscillations and drops below the required accuracy with an oscillating target. DualLQR was found to be able to meet the required final accuracy even with high oscillations, with an accuracy increase of 60% for high orientation oscillations. Further testing on a real-world apple grasping task showed that DualLQR was able to successfully grasp oscillating apples, with a success rate of 99%.
Abstract:With the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Multi-object tracking (MOT) algorithms can be categorized between two-stage and single-stage methods. Two-stage methods tend to be simpler to adapt and implement to custom applications, while single-stage methods present a more complex end-to-end tracking method that can yield better results in occluded situations at the cost of more training data. The potential advantages of single-stage methods over two-stage methods depends on the complexity of the sequence of viewpoints that a robot needs to process. In this work, we compare a 3D two-stage MOT algorithm, 3D-SORT, against a 3D single-stage MOT algorithm, MOT-DETR, in three different types of sequences with varying levels of complexity. The sequences represent simpler and more complex motions that a robot arm can perform in a tomato greenhouse. Our experiments in a tomato greenhouse show that the single-stage algorithm consistently yields better tracking accuracy, especially in the more challenging sequences where objects are fully occluded or non-visible during several viewpoints.
Abstract:This study presents an automated lameness detection system that uses deep-learning image processing techniques to extract multiple locomotion traits associated with lameness. Using the T-LEAP pose estimation model, the motion of nine keypoints was extracted from videos of walking cows. The videos were recorded outdoors, with varying illumination conditions, and T-LEAP extracted 99.6% of correct keypoints. The trajectories of the keypoints were then used to compute six locomotion traits: back posture measurement, head bobbing, tracking distance, stride length, stance duration, and swing duration. The three most important traits were back posture measurement, head bobbing, and tracking distance. For the ground truth, we showed that a thoughtful merging of the scores of the observers could improve intra-observer reliability and agreement. We showed that including multiple locomotion traits improves the classification accuracy from 76.6% with only one trait to 79.9% with the three most important traits and to 80.1% with all six locomotion traits.
Abstract:Robots are increasingly used in tomato greenhouses to automate labour-intensive tasks such as selective harvesting and de-leafing. To perform these tasks, robots must be able to accurately and efficiently perceive the plant nodes that need to be cut, despite the high levels of occlusion from other plant parts. We formulate this problem as a local next-best-view (NBV) planning task where the robot has to plan an efficient set of camera viewpoints to overcome occlusion and improve the quality of perception. Our formulation focuses on quickly improving the perception accuracy of a single target node to maximise its chances of being cut. Previous methods of NBV planning mostly focused on global view planning and used random sampling of candidate viewpoints for exploration, which could suffer from high computational costs, ineffective view selection due to poor candidates, or non-smooth trajectories due to inefficient sampling. We propose a gradient-based NBV planner using differential ray sampling, which directly estimates the local gradient direction for viewpoint planning to overcome occlusion and improve perception. Through simulation experiments, we showed that our planner can handle occlusions and improve the 3D reconstruction and position estimation of nodes equally well as a sampling-based NBV planner, while taking ten times less computation and generating 28% more efficient trajectories.
Abstract:In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr
Abstract:Greenhouse production of fruits and vegetables in developed countries is challenged by labor 12 scarcity and high labor costs. Robots offer a good solution for sustainable and cost-effective 13 production. Acquiring accurate spatial information about relevant plant parts is vital for 14 successful robot operation. Robot perception in greenhouses is challenging due to variations in 15 plant appearance, viewpoints, and illumination. This paper proposes a keypoint-detection-based 16 method using data from an RGB-D camera to estimate the 3D pose of peduncle nodes, which 17 provides essential information to harvest the tomato bunches. 18 19 Specifically, this paper proposes a method that detects four anatomical landmarks in the color 20 image and then integrates 3D point-cloud information to determine the 3D pose. A 21 comprehensive evaluation was conducted in a commercial greenhouse to gain insight into the 22 performance of different parts of the method. The results showed: (1) high accuracy in object 23 detection, achieving an Average Precision (AP) of AP@0.5=0.96; (2) an average Percentage of 24 Detected Joints (PDJ) of the keypoints of PhDJ@0.2=94.31%; and (3) 3D pose estimation 25 accuracy with mean absolute errors (MAE) of 11.38o and 9.93o for the relative upper and lower 26 angles between the peduncle and main stem, respectively. Furthermore, the capability to handle 27 variations in viewpoint was investigated, demonstrating the method was robust to view changes. 28 However, canonical and higher views resulted in slightly higher performance compared to other 29 views. Although tomato was selected as a use case, the proposed method is also applicable to 30 other greenhouse crops like pepper.
Abstract:The agro-food industry is turning to robots to address the challenge of labour shortage. However, agro-food environments pose difficulties for robots due to high variation and occlusions. In the presence of these challenges, accurate world models, with information about object location, shape, and properties, are crucial for robots to perform tasks accurately. Building such models is challenging due to the complex and unique nature of agro-food environments, and errors in the model can lead to task execution issues. In this paper, we propose MinkSORT, a novel method for generating tracking features using a 3D sparse convolutional network in a deepSORT-like approach to improve the accuracy of world models in agro-food environments. We evaluated our feature extractor network using real-world data collected in a tomato greenhouse, which significantly improved the performance of our baseline model that tracks tomato positions in 3D using a Kalman filter and Mahalanobis distance. Our deep learning feature extractor improved the HOTA from 42.8% to 44.77%, the association accuracy from 32.55% to 35.55%, and the MOTA from 57.63% to 58.81%. We also evaluated different contrastive loss functions for training our deep learning feature extractor and demonstrated that our approach leads to improved performance in terms of three separate precision and recall detection outcomes. Our method improves world model accuracy, enabling robots to perform tasks such as harvesting and plant maintenance with greater efficiency and accuracy, which is essential for meeting the growing demand for food in a sustainable manner.
Abstract:To automate harvesting and de-leafing of tomato plants using robots, it is important to search and detect the relevant plant parts, namely tomatoes, peduncles, and petioles. This is challenging due to high levels of occlusion in tomato greenhouses. Active vision is a promising approach which helps robots to deliberately plan camera viewpoints to overcome occlusion and improve perception accuracy. However, current active-vision algorithms cannot differentiate between relevant and irrelevant plant parts, making them inefficient for targeted perception of specific plant parts. We propose a semantic active-vision strategy that uses semantic information to identify the relevant plant parts and prioritises them during view planning using an attention mechanism. We evaluated our strategy using 3D models of tomato plants with varying structural complexity, which closely represented occlusions in the real world. We used a simulated environment to gain insights into our strategy, while ensuring repeatability and statistical significance. At the end of ten viewpoints, our strategy was able to correctly detect 85.5% of the plant parts, about 4 parts more on average per plant compared to a volumetric active-vision strategy. Also, it detected 5 and 9 parts more compared to two predefined strategies and 11 parts more compared to a random strategy. It also performed reliably with a median of 88.9% correctly-detected objects per plant in 96 experiments. Our strategy was also robust to uncertainty in plant and plant-part position, plant complexity, and different viewpoint sampling strategies. We believe that our work could significantly improve the speed and robustness of automated harvesting and de-leafing in tomato crop production.
Abstract:Accurate representation and localization of relevant objects is important for robots to perform tasks. Building a generic representation that can be used across different environments and tasks is not easy, as the relevant objects vary depending on the environment and the task. Furthermore, another challenge arises in agro-food environments due to their complexity, and high levels of clutter and occlusions. In this paper, we present a method to build generic representations in highly occluded agro-food environments using multi-view perception and 3D multi-object tracking. Our representation is built upon a detection algorithm that generates a partial point cloud for each detected object. The detected objects are then passed to a 3D multi-object tracking algorithm that creates and updates the representation over time. The whole process is performed at a rate of 10 Hz. We evaluated the accuracy of the representation on a real-world agro-food environment, where it was able to successfully represent and locate tomatoes in tomato plants despite a high level of occlusion. We were able to estimate the total count of tomatoes with a maximum error of 5.08% and to track tomatoes with a tracking accuracy up to 71.47%. Additionally, we showed that an evaluation using tracking metrics gives more insight in the errors in localizing and representing the fruits.