Abstract:We introduce Arboretum, the largest publicly accessible dataset designed to advance AI for biodiversity applications. This dataset, curated from the iNaturalist community science platform and vetted by domain experts to ensure accuracy, includes 134.6 million images, surpassing existing datasets in scale by an order of magnitude. The dataset encompasses image-language paired data for a diverse set of species from birds (Aves), spiders/ticks/mites (Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi), snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource for multimodal vision-language AI models for biodiversity assessment and agriculture research. Each image is annotated with scientific names, taxonomic details, and common names, enhancing the robustness of AI model training. We showcase the value of Arboretum by releasing a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations across life stages, rare species, confounding species, and various levels of the taxonomic hierarchy. We anticipate that Arboretum will spur the development of AI models that can enable a variety of digital tools ranging from pest control strategies, crop monitoring, and worldwide biodiversity assessment and environmental conservation. These advancements are critical for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. Arboretum is publicly available, easily accessible, and ready for immediate use. Please see the \href{https://baskargroup.github.io/Arboretum/}{project website} for links to our data, models, and code.
Abstract:Insect-pests significantly impact global agricultural productivity and quality. Effective management involves identifying the full insect community, including beneficial insects and harmful pests, to develop and implement integrated pest management strategies. Automated identification of insects under real-world conditions presents several challenges, including differentiating similar-looking species, intra-species dissimilarity and inter-species similarity, several life cycle stages, camouflage, diverse imaging conditions, and variability in insect orientation. A deep-learning model, InsectNet, is proposed to address these challenges. InsectNet is endowed with five key features: (a) utilization of a large dataset of insect images collected through citizen science; (b) label-free self-supervised learning for large models; (c) improving prediction accuracy for species with a small sample size; (d) enhancing model trustworthiness; and (e) democratizing access through streamlined MLOps. This approach allows accurate identification (>96% accuracy) of over 2500 insect species, including pollinator (e.g., butterflies, bees), parasitoid (e.g., some wasps and flies), predator species (e.g., lady beetles, mantises, dragonflies) and harmful pest species (e.g., armyworms, cutworms, grasshoppers, stink bugs). InsectNet can identify invasive species, provide fine-grained insect species identification, and work effectively in challenging backgrounds. It also can abstain from making predictions when uncertain, facilitating seamless human intervention and making it a practical and trustworthy tool. InsectNet can guide citizen science data collection, especially for invasive species where early detection is crucial. Similar approaches may transform other agricultural challenges like disease detection and underscore the importance of data collection, particularly through citizen science efforts..
Abstract:Deep learning-based approaches have produced models with good insect classification accuracy; Most of these models are conducive for application in controlled environmental conditions. One of the primary emphasis of researchers is to implement identification and classification models in the real agriculture fields, which is challenging because input images that are wildly out of the distribution (e.g., images like vehicles, animals, humans, or a blurred image of an insect or insect class that is not yet trained on) can produce an incorrect insect classification. Out-of-distribution (OOD) detection algorithms provide an exciting avenue to overcome these challenge as it ensures that a model abstains from making incorrect classification prediction of non-insect and/or untrained insect class images. We generate and evaluate the performance of state-of-the-art OOD algorithms on insect detection classifiers. These algorithms represent a diversity of methods for addressing an OOD problem. Specifically, we focus on extrusive algorithms, i.e., algorithms that wrap around a well-trained classifier without the need for additional co-training. We compared three OOD detection algorithms: (i) Maximum Softmax Probability, which uses the softmax value as a confidence score, (ii) Mahalanobis distance-based algorithm, which uses a generative classification approach; and (iii) Energy-Based algorithm that maps the input data to a scalar value, called energy. We performed an extensive series of evaluations of these OOD algorithms across three performance axes: (a) \textit{Base model accuracy}: How does the accuracy of the classifier impact OOD performance? (b) How does the \textit{level of dissimilarity to the domain} impact OOD performance? and (c) \textit{Data imbalance}: How sensitive is OOD performance to the imbalance in per-class sample size?