Abstract:Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeffding trees have been already clarified, a general and rigorous study of confidence intervals for splitting criteria is missing. We fill this gap by deriving accurate confidence intervals to estimate the splitting gain in decision tree learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. Our confidence intervals depend in a more detailed way on the tree parameters. We also extend our confidence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We furnish theoretical guarantee bounding the probability that the classification is non-optimal learning the decision tree via our selective sampling strategy. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by other techniques and our active learning module permits to save labeling cost. In addition, comparing our labeling strategy with recent methods, we show that our approach is more robust and consistent respect all the other techniques applied to incremental decision trees.
Abstract:Recognising human activities from streaming videos poses unique challenges to learning algorithms: predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily long. Furthermore, as parameter tuning is problematic in a streaming setting, suitable approaches should be parameterless, and make no assumptions on what class labels may occur in the stream. We present here an approach to the recognition of human actions from streaming data which meets all these requirements by: (1) incrementally learning a model which adaptively covers the feature space with simple local classifiers; (2) employing an active learning strategy to reduce annotation requests; (3) achieving promising accuracy within a fixed model size. Extensive experiments on standard benchmarks show that our approach is competitive with state-of-the-art non-incremental methods, and outperforms the existing active incremental baselines.
Abstract:As we enter into the big data age and an avalanche of images have become readily available, recognition systems face the need to move from close, lab settings where the number of classes and training data are fixed, to dynamic scenarios where the number of categories to be recognized grows continuously over time, as well as new data providing useful information to update the system. Recent attempts, like the open world recognition framework, tried to inject dynamics into the system by detecting new unknown classes and adding them incrementally, while at the same time continuously updating the models for the known classes. incrementally adding new classes and detecting instances from unknown classes, while at the same time continuously updating the models for the known classes. In this paper we argue that to properly capture the intrinsic dynamic of open world recognition, it is necessary to add to these aspects (a) the incremental learning of the underlying metric, (b) the incremental estimate of confidence thresholds for the unknown classes, and (c) the use of local learning to precisely describe the space of classes. We extend three existing metric learning algorithms towards these goals by using online metric learning. Experimentally we validate our approach on two large-scale datasets in different learning scenarios. For all these scenarios our proposed methods outperform their non-online counterparts. We conclude that local and online learning is important to capture the full dynamics of open world recognition.
Abstract:Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless ---traditional tuning methods are problematic in streaming settings--- and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. We design four variants of our approach of increasing adaptivity. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. For the variant that imposes a strict bound on the model size, we show better performance against all other methods measured at the same model size value. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream.