Abstract:Controlling greenhouse crop production systems is a complex task due to uncertain and non-linear dynamics between crops, indoor and outdoor climate, and economics. The declining number of skilled growers necessitates the development of autonomous greenhouse control systems. Reinforcement Learning (RL) is a promising approach that can learn a control policy to automate greenhouse management. RL optimises a control policy through interactions with a model of the greenhouse while guided by an economic-based reward function. However, its application to real-world systems is limited due to discrepancies between models and real-world dynamics. Moreover, RL controllers may struggle to maintain state constraints while optimising the primary objective, especially when models inadequately capture the adverse effects of constraint violations on crop growth. Also, the generalisation to novel states, for example, due to unseen weather trajectories, is underexplored in RL-based greenhouse control. This work addresses these challenges through three key contributions. First, we present GreenLight-Gym, the first open-source environment designed for training and evaluating RL algorithms on the state-of-the-art greenhouse model GreenLight. GreenLight-Gym enables the community to benchmark RL-based control methodologies. Second, we compare two reward-shaping approaches, using either a multiplicative or additive penalty, to enforce state boundaries. The additive penalty achieves more stable training while better adhering to state constraints, while the multiplicative penalty yields marginally higher profits. Finally, we evaluate RL performance on a disjoint training and testing weather dataset, demonstrating improved generalisation to unseen conditions. Our environment and experiment scripts are open-sourced, facilitating innovative research on learning-based greenhouse control.
Abstract:Learning from Demonstration offers great potential for robots to learn to perform agricultural tasks, specifically selective harvesting. One of the challenges is that the target fruit can be oscillating while approaching. Grasping oscillating targets has two requirements: 1) close tracking of the target during the final approach for damage-free grasping, and 2) the complete path should be as short as possible for improved efficiency. We propose a new method called DualLQR. In this method, we use a finite horizon Linear Quadratic Regulator (LQR) on a moving target, without the need of refitting the LQR. To make this possible, we use a dual LQR setup, with an LQR running in two seperate reference frames. Through extensive simulation testing, it was found that the state-of-art method barely meets the required final accuracy without oscillations and drops below the required accuracy with an oscillating target. DualLQR was found to be able to meet the required final accuracy even with high oscillations, with an accuracy increase of 60% for high orientation oscillations. Further testing on a real-world apple grasping task showed that DualLQR was able to successfully grasp oscillating apples, with a success rate of 99%.
Abstract:With the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Multi-object tracking (MOT) algorithms can be categorized between two-stage and single-stage methods. Two-stage methods tend to be simpler to adapt and implement to custom applications, while single-stage methods present a more complex end-to-end tracking method that can yield better results in occluded situations at the cost of more training data. The potential advantages of single-stage methods over two-stage methods depends on the complexity of the sequence of viewpoints that a robot needs to process. In this work, we compare a 3D two-stage MOT algorithm, 3D-SORT, against a 3D single-stage MOT algorithm, MOT-DETR, in three different types of sequences with varying levels of complexity. The sequences represent simpler and more complex motions that a robot arm can perform in a tomato greenhouse. Our experiments in a tomato greenhouse show that the single-stage algorithm consistently yields better tracking accuracy, especially in the more challenging sequences where objects are fully occluded or non-visible during several viewpoints.
Abstract:This study presents an automated lameness detection system that uses deep-learning image processing techniques to extract multiple locomotion traits associated with lameness. Using the T-LEAP pose estimation model, the motion of nine keypoints was extracted from videos of walking cows. The videos were recorded outdoors, with varying illumination conditions, and T-LEAP extracted 99.6% of correct keypoints. The trajectories of the keypoints were then used to compute six locomotion traits: back posture measurement, head bobbing, tracking distance, stride length, stance duration, and swing duration. The three most important traits were back posture measurement, head bobbing, and tracking distance. For the ground truth, we showed that a thoughtful merging of the scores of the observers could improve intra-observer reliability and agreement. We showed that including multiple locomotion traits improves the classification accuracy from 76.6% with only one trait to 79.9% with the three most important traits and to 80.1% with all six locomotion traits.
Abstract:Robots are increasingly used in tomato greenhouses to automate labour-intensive tasks such as selective harvesting and de-leafing. To perform these tasks, robots must be able to accurately and efficiently perceive the plant nodes that need to be cut, despite the high levels of occlusion from other plant parts. We formulate this problem as a local next-best-view (NBV) planning task where the robot has to plan an efficient set of camera viewpoints to overcome occlusion and improve the quality of perception. Our formulation focuses on quickly improving the perception accuracy of a single target node to maximise its chances of being cut. Previous methods of NBV planning mostly focused on global view planning and used random sampling of candidate viewpoints for exploration, which could suffer from high computational costs, ineffective view selection due to poor candidates, or non-smooth trajectories due to inefficient sampling. We propose a gradient-based NBV planner using differential ray sampling, which directly estimates the local gradient direction for viewpoint planning to overcome occlusion and improve perception. Through simulation experiments, we showed that our planner can handle occlusions and improve the 3D reconstruction and position estimation of nodes equally well as a sampling-based NBV planner, while taking ten times less computation and generating 28% more efficient trajectories.
Abstract:In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr
Abstract:The agro-food industry is turning to robots to address the challenge of labour shortage. However, agro-food environments pose difficulties for robots due to high variation and occlusions. In the presence of these challenges, accurate world models, with information about object location, shape, and properties, are crucial for robots to perform tasks accurately. Building such models is challenging due to the complex and unique nature of agro-food environments, and errors in the model can lead to task execution issues. In this paper, we propose MinkSORT, a novel method for generating tracking features using a 3D sparse convolutional network in a deepSORT-like approach to improve the accuracy of world models in agro-food environments. We evaluated our feature extractor network using real-world data collected in a tomato greenhouse, which significantly improved the performance of our baseline model that tracks tomato positions in 3D using a Kalman filter and Mahalanobis distance. Our deep learning feature extractor improved the HOTA from 42.8% to 44.77%, the association accuracy from 32.55% to 35.55%, and the MOTA from 57.63% to 58.81%. We also evaluated different contrastive loss functions for training our deep learning feature extractor and demonstrated that our approach leads to improved performance in terms of three separate precision and recall detection outcomes. Our method improves world model accuracy, enabling robots to perform tasks such as harvesting and plant maintenance with greater efficiency and accuracy, which is essential for meeting the growing demand for food in a sustainable manner.
Abstract:To automate harvesting and de-leafing of tomato plants using robots, it is important to search and detect the relevant plant parts, namely tomatoes, peduncles, and petioles. This is challenging due to high levels of occlusion in tomato greenhouses. Active vision is a promising approach which helps robots to deliberately plan camera viewpoints to overcome occlusion and improve perception accuracy. However, current active-vision algorithms cannot differentiate between relevant and irrelevant plant parts, making them inefficient for targeted perception of specific plant parts. We propose a semantic active-vision strategy that uses semantic information to identify the relevant plant parts and prioritises them during view planning using an attention mechanism. We evaluated our strategy using 3D models of tomato plants with varying structural complexity, which closely represented occlusions in the real world. We used a simulated environment to gain insights into our strategy, while ensuring repeatability and statistical significance. At the end of ten viewpoints, our strategy was able to correctly detect 85.5% of the plant parts, about 4 parts more on average per plant compared to a volumetric active-vision strategy. Also, it detected 5 and 9 parts more compared to two predefined strategies and 11 parts more compared to a random strategy. It also performed reliably with a median of 88.9% correctly-detected objects per plant in 96 experiments. Our strategy was also robust to uncertainty in plant and plant-part position, plant complexity, and different viewpoint sampling strategies. We believe that our work could significantly improve the speed and robustness of automated harvesting and de-leafing in tomato crop production.
Abstract:Accurate representation and localization of relevant objects is important for robots to perform tasks. Building a generic representation that can be used across different environments and tasks is not easy, as the relevant objects vary depending on the environment and the task. Furthermore, another challenge arises in agro-food environments due to their complexity, and high levels of clutter and occlusions. In this paper, we present a method to build generic representations in highly occluded agro-food environments using multi-view perception and 3D multi-object tracking. Our representation is built upon a detection algorithm that generates a partial point cloud for each detected object. The detected objects are then passed to a 3D multi-object tracking algorithm that creates and updates the representation over time. The whole process is performed at a rate of 10 Hz. We evaluated the accuracy of the representation on a real-world agro-food environment, where it was able to successfully represent and locate tomatoes in tomato plants despite a high level of occlusion. We were able to estimate the total count of tomatoes with a maximum error of 5.08% and to track tomatoes with a tracking accuracy up to 71.47%. Additionally, we showed that an evaluation using tracking metrics gives more insight in the errors in localizing and representing the fruits.
Abstract:Visual reconstruction of tomato plants by a robot is extremely challenging due to the high levels of variation and occlusion in greenhouse environments. The paradigm of active-vision helps overcome these challenges by reasoning about previously acquired information and systematically planning camera viewpoints to gather novel information about the plant. However, existing active-vision algorithms cannot perform well on targeted perception objectives, such as the 3D reconstruction of leaf nodes, because they do not distinguish between the plant-parts that need to be reconstructed and the rest of the plant. In this paper, we propose an attention-driven active-vision algorithm that considers only the relevant plant-parts according to the task-at-hand. The proposed approach was evaluated in a simulated environment on the task of 3D reconstruction of tomato plants at varying levels of attention, namely the whole plant, the main stem and the leaf nodes. Compared to pre-defined and random approaches, our approach improves the accuracy of 3D reconstruction by 9.7% and 5.3% for the whole plant, 14.2% and 7.9% for the main stem, and 25.9% and 17.3% for the leaf nodes respectively within the first 3 viewpoints. Also, compared to pre-defined and random approaches, our approach reconstructs 80% of the whole plant and the main stem in 1 less viewpoint and 80% of the leaf nodes in 3 less viewpoints. We also demonstrated that the attention-driven NBV planner works effectively despite changes to the plant models, the amount of occlusion, the number of candidate viewpoints and the resolutions of reconstruction. By adding an attention mechanism to active-vision, it is possible to efficiently reconstruct the whole plant and targeted plant parts. We conclude that an attention mechanism for active-vision is necessary to significantly improve the quality of perception in complex agro-food environments.