Abstract:Camera traps are a strategy for monitoring wildlife that collects a large number of pictures. The number of images collected from each species usually follows a long-tail distribution, i.e., a few classes have a large number of instances while a lot of species have just a small percentage. Although in most cases these rare species are the classes of interest to ecologists, they are often neglected when using deep learning models because these models require a large number of images for the training. In this work, we systematically evaluate recently proposed techniques - namely, square-root re-sampling, class-balanced focal loss, and balanced group softmax - to address the long-tail visual recognition of animal species in camera trap images. To achieve a more general conclusion, we evaluated the selected methods on four families of computer vision models (ResNet, MobileNetV3, EfficientNetV2, and Swin Transformer) and four camera trap datasets with different characteristics. Initially, we prepared a robust baseline with the most recent training tricks and then we applied the methods for improving long-tail recognition. Our experiments show that the Swin transformer can reach high performance for rare classes without applying any additional method for handling imbalance, with an overall accuracy of 88.76% for WCS dataset and 94.97% for Snapshot Serengeti, considering a location-based train/test split. In general, the square-root sampling was the method that most improved the performance for minority classes by around 10%, but at the cost of reducing the majority classes accuracy at least 4%. These results motivated us to propose a simple and effective approach using an ensemble combining square-root sampling and the baseline. The proposed approach achieved the best trade-off between the performance of the tail class and the cost of the head classes' accuracy.
Abstract:Monitoring wildlife through camera traps produces a massive amount of images, whose a significant portion does not contain animals, being later discarded. Embedding deep learning models to identify animals and filter these images directly in those devices brings advantages such as savings in the storage and transmission of data, usually resource-constrained in this type of equipment. In this work, we present a comparative study on animal recognition models to analyze the trade-off between precision and inference latency on edge devices. To accomplish this objective, we investigate classifiers and object detectors of various input resolutions and optimize them using quantization and reducing the number of model filters. The confidence threshold of each model was adjusted to obtain 96% recall for the nonempty class, since instances from the empty class are expected to be discarded. The experiments show that, when using the same set of images for training, detectors achieve superior performance, eliminating at least 10% more empty images than classifiers with comparable latencies. Considering the high cost of generating labels for the detection problem, when there is a massive number of images labeled for classification (about one million instances, ten times more than those available for detection), classifiers are able to reach results comparable to detectors but with half latency.
Abstract:Automatic analysis of bioacoustic signals is a fundamental tool to evaluate the vitality of our planet. Frogs and bees, for instance, may act like biological sensors providing information about environmental changes. This task is fundamental for ecological monitoring still includes many challenges such as nonuniform signal length processing, degraded target signal due to environmental noise, and the scarcity of the labeled samples for training machine learning. To tackle these challenges, we present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently. The proposed classifier does not require a large amount of training data and handles nonuniform signal length natively. Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces generated by applying Singular Spectrum Analysis (SSA). Then, a subspace is designed to expose discriminative features. The proposed model shares end-to-end capabilities, which is desirable in modern machine learning systems. This formulation provides a segmentation-free and noise-tolerant approach to represent and classify bioacoustic signals and a highly compact signal descriptor inherited from SSA. The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species. Experimental results on three bioacoustic datasets have shown the competitive performance of the proposed method compared to commonly employed methods for bioacoustics signal classification in terms of accuracy.
Abstract:The increasing use of multiple sensors requires more efficient methods to represent and classify multi-dimensional data, since these applications produce a large amount of data, demanding modern techniques for data processing. Considering these observations, we present in this paper a new method for multi-dimensional data classification which relies on two premises: 1) multi-dimensional data are usually represented by tensors, due to benefits from multilinear algebra and the established tensor factorization methods; and 2) this kind of data can be described by a subspace lying within a vector space. Subspace representation has been consistently employed for pattern-set recognition, and its tensor representation counterpart is also available in the literature. However, traditional methods do not employ discriminative information of the tensors, which degrades the classification accuracy. In this scenario, generalized difference subspace (GDS) may provide an enhanced subspace representation by reducing data redundancy and revealing discriminative structures. Since GDS is not able to directly handle tensor data, we propose a new projection called n-mode GDS, which efficiently handles tensor data. In addition, n-mode Fisher score is introduced as a class separability index and an improved metric based on the geodesic distance is provided to measure the similarity between tensor data. To confirm the advantages of the proposed method, we address the problem of representing and classifying tensor data for gesture and action recognition. The experimental results have shown that the proposed approach outperforms methods commonly used in the literature without adopting pre-trained models or transfer learning.