Abstract:Evaluating tracking model performance is a complicated task, particularly for non-contiguous, multi-object trackers that are crucial in defense applications. While there are various excellent tracking benchmarks available, this work expands them to quantify the performance of long-term, non-contiguous, multi-object and detection model assisted trackers. We propose a suite of MONCE (Multi-Object Non-Contiguous Entities) image tracking metrics that provide both objective tracking model performance benchmarks as well as diagnostic insight for driving tracking model development in the form of Expected Average Overlap, Short/Long Term Re-Identification, Tracking Recall, Tracking Precision, Longevity, Localization and Absence Prediction.
Abstract:Multi-object tracking (MOT) is a crucial component of situational awareness in military defense applications. With the growing use of unmanned aerial systems (UASs), MOT methods for aerial surveillance is in high demand. Application of MOT in UAS presents specific challenges such as moving sensor, changing zoom levels, dynamic background, illumination changes, obscurations and small objects. In this work, we present a robust object tracking architecture aimed to accommodate for the noise in real-time situations. We propose a kinematic prediction model, called Deep Extended Kalman Filter (DeepEKF), in which a sequence-to-sequence architecture is used to predict entity trajectories in latent space. DeepEKF utilizes a learned image embedding along with an attention mechanism trained to weight the importance of areas in an image to predict future states. For the visual scoring, we experiment with different similarity measures to calculate distance based on entity appearances, including a convolutional neural network (CNN) encoder, pre-trained using Siamese networks. In initial evaluation experiments, we show that our method, combining scoring structure of the kinematic and visual models within a MHT framework, has improved performance especially in edge cases where entity motion is unpredictable, or the data presents frames with significant gaps.