Abstract:Language identification is a task of automatically determining the identity of a language conveyed by a spoken segment. It has a profound impact on the multilingual interoperability of an intelligent speech system. Despite language identification attaining high accuracy on medium or long utterances (>3s), the performance on short utterances (<=1s) is still far from satisfactory. We propose an effective BERT-based language identification system (BERT-LID) to improve language identification performance, especially on short-duration speech segments. To adapt BERT into the LID pipeline, we drop in a conjunction network prior to BERT to accommodate the frame-level Phonetic Posteriorgrams(PPG) derived from the frontend phone recognizer and then fine-tune the conjunction network and BERT pre-trained model together. We evaluate several variations within this piped framework, including combining BERT with CNN, LSTM, DPCNN, and RCNN. The experimental results demonstrate that the best-performing model is RCNN-BERT. Compared with the prior works, our RCNN-BERT model can improve the accuracy by about 5% on long-segment identification and 18% on short-segment identification. The outperformance of our model, especially on the short-segment task, demonstrates the applicability of our proposed BERT-based approach on language identification.
Abstract:One of the fundamental requirements for visual surveillance using smart camera networks is the correct association of each persons observations generated on different cameras. Recently, distributed data association that involves only local information processing on each camera node and mutual information exchanging between neighboring cameras has attracted many research interests due to its superiority in large scale applications. In this paper, we formulate the problem of data association in smart camera networks as an Integer Programming problem by introducing a set of linking variables, and propose two distributed algorithms, namely L-DD and Q-DD, to solve the Integer Programming problem using dual decomposition technique. In our algorithms, the original IP problem is decomposed into several sub-problems, which can be solved locally and efficiently on each smart camera, and then different sub-problems reach consensus on their solutions in a rigorous way by adjusting their parameters based on projected sub-gradient optimization. The proposed methods are simple and flexible, in that (i) we can incorporate any feature extraction and matching technique into our framework to measure the similarity between two observations, which is used to define the cost of each link, and (ii) we can decompose the original problem in any way as long as the resulting sub-problem can be solved independently on individual camera. We show the competitiveness of our methods in both accuracy and speed by theoretical analysis and experimental comparison with state of the art algorithms on two real data sets collected by camera networks in our campus garden and office building.