Abstract:In this paper, we propose a highly practical fully online multi-object tracking and segmentation (MOTS) method that uses instance segmentation results as an input in video. The proposed method exploits the Gaussian mixture probability hypothesis density (GMPHD) filter for online approach which is extended with a hierarchical data association (HDA) and a simple affinity fusion (SAF) model. HDA consists of segment-to-track and track-to-track associations. To build the SAF model, an affinity is computed by using the GMPHD filter that is represented by the Gaussian mixture models with position and motion mean vectors, and another affinity for appearance is computed by using the responses from single object tracker such as the kernalized correlation filters. These two affinities are simply fused by using a score-level fusion method such as Min-max normalization. In addition, to reduce false positive segments, we adopt Mask IoU based merging. In experiments, those key modules, i.e., HDA, SAF, and Mask merging show incremental improvements. For instance, ID-switch decreases by half compared to baseline method. In conclusion, our tracker achieves state-of-the-art level MOTS performance.
Abstract:In this paper, we propose an efficient online multi-object tracking framework based on the GMPHD filter and occlusion group management scheme where the GMPHD filter utilizes hierarchical data association to reduce the false negatives caused by miss detection. The hierarchical data association consists of two steps: detection-to-track and track-to-track associations, which can recover the lost tracks and their switched IDs. In addition, the proposed framework is equipped with an object grouping management scheme which handles occlusion problems with two main parts. The first part is "track merging" which can merge the false positive tracks caused by false positive detections from occlusions, where the false positive tracks are usually occluded with a measure. The measure is the occlusion ratio between visual objects, sum-of-intersection-over-area (SIOA) we defined instead of the IOU metric. The second part is "occlusion group energy minimization (OGEM)" which prevents the occluded true positive tracks from false "track merging". We define each group of the occluded objects as an energy function and find an optimal hypothesis which makes the energy minimal. We evaluate the proposed tracker in benchmark datasets such as MOT15 and MOT17 which are built for multi-person tracking. An ablation study in training dataset shows that not only "track merging" and "OGEM" complement each other but also the proposed tracking method has more robust performance and less sensitive to parameters than baseline methods. Also, SIOA works better than IOU for various sizes of false positives. Experimental results show that the proposed tracker efficiently handles occlusion situations and achieves competitive performance compared to the state-of-the-art methods. Especially, our method shows the best multi-object tracking accuracy among the online and real-time executable methods.
Abstract:In online multiple pedestrian tracking it is of great importance to construct reliable cost matrix for assigning observations to tracks. Each element of cost matrix is constructed by using similarity measure. Many previous works have proposed their own similarity calculation methods consisting of geometric model (e.g. bounding box coordinates) and appearance model. In particular, appearance model contains information with higher dimension compared to geometric model. Thanks to the recent success of deep learning based methods, handling of high dimensional appearance information becomes possible. Among many deep networks, a siamese network with triplet loss is popularly adopted as an appearance feature extractor. Since the siamese network can extract features of each input independently, it is possible to adaptively model tracks (e.g. linear update). However, it is not suitable for multi-object setting that requires comparison with other inputs. In this paper we propose a novel track appearance modeling based on joint inference network to address this issue. The proposed method enables comparison of two inputs to be used for adaptive appearance modeling. It contributes to disambiguating target-observation matching and consolidating the identity consistency. Intensive experimental results support effectiveness of our method.
Abstract:In this study, a multiple hypothesis tracking (MHT) algorithm for multi-target multi-camera tracking (MCT) with disjoint views is proposed. Our method forms track-hypothesis trees, and each branch of them represents a multi-camera track of a target that may move within a camera as well as move across cameras. Furthermore, multi-target tracking within a camera is performed simultaneously with the tree formation by manipulating a status of each track hypothesis. Each status represents three different stages of a multi-camera track: tracking, searching, and end-of-track. The tracking status means targets are tracked by a single camera tracker. In the searching status, the disappeared targets are examined if they reappear in other cameras. The end-of-track status does the target exited the camera network due to its lengthy invisibility. These three status assists MHT to form the track-hypothesis trees for multi-camera tracking. Furthermore, they present a gating technique for eliminating of unlikely observation-to-track association. In the experiments, they evaluate the proposed method using two datasets, DukeMTMC and NLPR-MCT, which demonstrates that the proposed method outperforms the state-of-the-art method in terms of improvement of the accuracy. In addition, they show that the proposed method can operate in real-time and online.
Abstract:In this paper, we propose the methods to handle temporal errors during multi-object tracking. Temporal error occurs when objects are occluded or noisy detections appear near the object. In those situations, tracking may fail and various errors like drift or ID-switching occur. It is hard to overcome temporal errors only by using motion and shape information. So, we propose the historical appearance matching method and joint-input siamese network which was trained by 2-step process. It can prevent tracking failures although objects are temporally occluded or last matching information is unreliable. We also provide useful technique to remove noisy detections effectively according to scene condition. Tracking performance, especially identity consistency, is highly improved by attaching our methods.