Abstract:We propose a generative model for the spatio-temporal distribution of high dimensional categorical observations. These are commonly produced by robots equipped with an imaging sensor such as a camera, paired with an image classifier, potentially producing observations over thousands of categories. The proposed approach combines the use of Dirichlet distributions to model sparse co-occurrence relations between the observed categories using a latent variable, and Gaussian processes to model the latent variable's spatio-temporal distribution. Experiments in this paper show that the resulting model is able to efficiently and accurately approximate the temporal distribution of high dimensional categorical measurements such as taxonomic observations of microscopic organisms in the ocean, even in unobserved (held out) locations, far from other samples. This work's primary motivation is to enable deployment of informative path planning techniques over high dimensional categorical fields, which until now have been limited to scalar or low dimensional vector observations.
Abstract:Many interesting natural phenomena are sparsely distributed and discrete. Locating the hotspots of such sparsely distributed phenomena is often difficult because their density gradient is likely to be very noisy. We present a novel approach to this search problem, where we model the co-occurrence relations between a robot's observations with a Bayesian nonparametric topic model. This approach makes it possible to produce a robust estimate of the spatial distribution of the target, even in the absence of direct target observations. We apply the proposed approach to the problem of finding the spatial locations of the hotspots of a specific phytoplankton taxon in the ocean. We use classified image data from Imaging FlowCytobot (IFCB), which automatically measures individual microscopic cells and colonies of cells. Given these individual taxon-specific observations, we learn a phytoplankton community model that characterizes the co-occurrence relations between taxa. We present experiments with simulated robot missions drawn from real observation data collected during a research cruise traversing the US Atlantic coast. Our results show that the proposed approach outperforms nearest neighbor and k-means based methods for predicting the spatial distribution of hotspots from in-situ observations.
Abstract:Planktonic organisms are of fundamental importance to marine ecosystems: they form the basis of the food web, provide the link between the atmosphere and the deep ocean, and influence global-scale biogeochemical cycles. Scientists are increasingly using imaging-based technologies to study these creatures in their natural habit. Images from such systems provide an unique opportunity to model and understand plankton ecosystems, but the collected datasets can be enormous. The Imaging FlowCytobot (IFCB) at Woods Hole Oceanographic Institution, for example, is an \emph{in situ} system that has been continuously imaging plankton since 2006. To date, it has generated more than 700 million samples. Manual classification of such a vast image collection is impractical due to the size of the data set. In addition, the annotation task is challenging due to the large space of relevant classes, intra-class variability, and inter-class similarity. Methods for automated classification exist, but the accuracy is often below that of human experts. Here we introduce WHOI-Plankton: a large scale, fine-grained visual recognition dataset for plankton classification, which comprises over 3.4 million expert-labeled images across 70 classes. The labeled image set is complied from over 8 years of near continuous data collection with the IFCB at the Martha's Vineyard Coastal Observatory (MVCO). We discuss relevant metrics for evaluation of classification performance and provide results for a traditional method based on hand-engineered features and two methods based on convolutional neural networks.