Abstract:In public safety and social life, the task of Clothes-Changing Person Re-Identification (CC-ReID) has become increasingly significant. However, this task faces considerable challenges due to appearance changes caused by clothing alterations. Addressing this issue, this paper proposes an innovative method for disentangled feature extraction, effectively extracting discriminative features from pedestrian images that are invariant to clothing. This method leverages pedestrian parsing techniques to identify and retain features closely associated with individual identity while disregarding the variable nature of clothing attributes. Furthermore, this study introduces a gated channel attention mechanism, which, by adjusting the network's focus, aids the model in more effectively learning and emphasizing features critical for pedestrian identity recognition. Extensive experiments conducted on two standard CC-ReID datasets validate the effectiveness of the proposed approach, with performance surpassing current leading solutions. The Top-1 accuracy under clothing change scenarios on the PRCC and VC-Clothes datasets reached 64.8% and 83.7%, respectively.
Abstract:Traditional text-based person re-identification (ReID) techniques heavily rely on fully matched multi-modal data, which is an ideal scenario. However, due to inevitable data missing and corruption during the collection and processing of cross-modal data, the incomplete data issue is usually met in real-world applications. Therefore, we consider a more practical task termed the incomplete text-based ReID task, where person images and text descriptions are not completely matched and contain partially missing modality data. To this end, we propose a novel Prototype-guided Cross-modal Completion and Alignment (PCCA) framework to handle the aforementioned issues for incomplete text-based ReID. Specifically, we cannot directly retrieve person images based on a text query on missing modality data. Therefore, we propose the cross-modal nearest neighbor construction strategy for missing data by computing the cross-modal similarity between existing images and texts, which provides key guidance for the completion of missing modal features. Furthermore, to efficiently complete the missing modal features, we construct the relation graphs with the aforementioned cross-modal nearest neighbor sets of missing modal data and the corresponding prototypes, which can further enhance the generated missing modal features. Additionally, for tighter fine-grained alignment between images and texts, we raise a prototype-aware cross-modal alignment loss that can effectively reduce the modality heterogeneity gap for better fine-grained alignment in common space. Extensive experimental results on several benchmarks with different missing ratios amply demonstrate that our method can consistently outperform state-of-the-art text-image ReID approaches.