Abstract:Evaluation of keyword spotting (KWS) systems that detect keywords in speech is a challenging task under realistic privacy constraints. The KWS is designed to only collect data when the keyword is present, limiting the availability of hard samples that may contain false negatives, and preventing direct estimation of model recall from production data. Alternatively, complementary data collected from other sources may not be fully representative of the real application. In this work, we propose an evaluation technique which we call AB/BA analysis. Our framework evaluates a candidate KWS model B against a baseline model A, using cross-dataset offline decoding for relative recall estimation, without requiring negative examples. Moreover, we propose a formulation with assumptions that allow estimation of relative false positive rate between models with low variance even when the number of false positives is small. Finally, we propose to leverage machine-generated soft labels, in a technique we call Semi-Supervised AB/BA analysis, that improves the analysis time, privacy, and cost. Experiments with both simulation and real data show that AB/BA analysis is successful at measuring recall improvement in conjunction with the trade-off in relative false positive rate.
Abstract:Label propagation on the tensor product of multiple graphs can infer multi-relations among the entities across the graphs by learning labels in a tensor. However, the tensor formulation is only empirically scalable up to three graphs due to the exponential complexity of computing tensors. In this paper, we propose an optimization formulation and a scalable Lowrank Tensor-based Label Propagation algorithm (LowrankTLP). The optimization formulation minimizes the rank-k approximation error for computing the closed-form solution of label propagation on a tensor product graph with efficient tensor computations used in LowrankTLP. LowrankTLP takes either a sparse tensor of known multi-relations or pairwise relations between each pair of graphs as the input to infer unknown multi-relations by semi-supervised learning on the tensor product graph. We also accelerate LowrankTLP with parallel tensor computation which enabled label propagation on a tensor product of 100 graphs of size 1000 within 150 seconds in simulation. LowrankTLP was also successfully applied to multi-relational learning for predicting author-paper-venue in publication records, alignment of several protein-protein interaction networks across species and alignment of segmented regions across up to 7 CT scan images. The experiments prove that LowrankTLP indeed well approximates the original label propagation with high scalability. Source code: https://github.com/kuanglab/LowrankTLP
Abstract:The success of semi-supervised learning crucially relies on the scalability to a huge amount of unlabelled data that are needed to capture the underlying manifold structure for better classification. Since computing the pairwise similarity between the training data is prohibitively expensive in most kinds of input data, currently, there is no general ready-to-use semi-supervised learning method/tool available for learning with tens of millions or more data points. In this paper, we adopted the idea of two low-rank label propagation algorithms, GLNP (Global Linear Neighborhood Propagation) and Kernel Nystr\"om Approximation, and implemented the parallelized version of the two algorithms accelerated with Nesterov's accelerated projected gradient descent for Big-data Label Propagation (BigLP). The parallel algorithms are tested on five real datasets ranging from 7000 to 10,000,000 in size and a simulation dataset of 100,000,000 samples. In the experiments, the implementation can scale up to datasets with 100,000,000 samples and hundreds of features and the algorithms also significantly improved the prediction accuracy when only a very small percentage of the data is labeled. The results demonstrate that the BigLP implementation is highly scalable to big data and effective in utilizing the unlabeled data for semi-supervised learning.