Abstract:Existing deep multi-object tracking (MOT) approaches first learn a deep representation to describe target objects and then associate detection results by optimizing a linear assignment problem. Despite demonstrated successes, it is challenging to discriminate target objects under mutual occlusion or to reduce identity switches in crowded scenes. In this paper, we propose learning deep conditional random field (CRF) networks, aiming to model the assignment costs as unary potentials and the long-term dependencies among detection results as pairwise potentials. Specifically, we use a bidirectional long short-term memory (LSTM) network to encode the long-term dependencies. We pose the CRF inference as a recurrent neural network learning process using the standard gradient descent algorithm, where unary and pairwise potentials are jointly optimized in an end-to-end manner. Extensive experimental results on the challenging MOT datasets including MOT-2015 and MOT-2016, demonstrate that our approach achieves the state of the art performances in comparison with published works on both benchmarks.
Abstract:Designing a robust affinity model is the key issue in multiple target tracking (MTT). This paper proposes a novel affinity model by learning feature representation and distance metric jointly in a unified deep architecture. Specifically, we design a CNN network to obtain appearance cue tailored towards person Re-ID, and an LSTM network for motion cue to predict target position, respectively. Both cues are combined with a triplet loss function, which performs end-to-end learning of the fused features in a desired embedding space. Experiments in the challenging MOT benchmark demonstrate, that even by a simple Linear Assignment strategy fed with affinity scores of our method, very competitive results are achieved when compared with the most recent state-of-theart approaches.