Abstract:Due to the successful development of deep image generation technology, forgery detection plays a more important role in social and economic security. Racial bias has not been explored thoroughly in the deep forgery detection field. In the paper, we first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public state-of-the-art (SOTA) methods. Different from existing forgery detection datasets, the self-construct FairFD dataset contains a balanced racial ratio and diverse forgery generation images with the largest-scale subjects. Additionally, we identify the problems with naive fairness metrics when benchmarking forgery detection models. To comprehensively evaluate fairness, we design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results. Extensive experiments conducted with nine representative forgery detection models demonstrate the value of the proposed dataset and the reasonability of the designed fairness metrics. We also conduct more in-depth analyses to offer more insights to inspire researchers in the community.
Abstract:Deep generator technology can produce high-quality fake videos that are indistinguishable, posing a serious social threat. Traditional forgery detection methods directly centralized training on data and lacked consideration of information sharing in non-public video data scenarios and data privacy. Naturally, the federated learning strategy can be applied for privacy protection, which aggregates model parameters of clients but not original data. However, simple federated learning can't achieve satisfactory performance because of poor generalization capabilities for the real hybrid-domain forgery dataset. To solve the problem, the paper proposes a novel federated face forgery detection learning with personalized representation. The designed Personalized Forgery Representation Learning aims to learn the personalized representation of each client to improve the detection performance of individual client models. In addition, a personalized federated learning training strategy is utilized to update the parameters of the distributed detection model. Here collaborative training is conducted on multiple distributed client devices, and shared representations of these client models are uploaded to the server side for aggregation. Experiments on several public face forgery detection datasets demonstrate the superior performance of the proposed algorithm compared with state-of-the-art methods. The code is available at \emph{https://github.com/GANG370/PFR-Forgery.}
Abstract:Cloth-changing person re-identification (CC-ReID) aims to match persons who change clothes over long periods. The key challenge in CC-ReID is to extract clothing-independent features, such as face, hairstyle, body shape, and gait. Current research mainly focuses on modeling body shape using multi-modal biological features (such as silhouettes and sketches). However, it does not fully leverage the personal description information hidden in the original RGB image. Considering that there are certain attribute descriptions which remain unchanged after the changing of cloth, we propose a Masked Attribute Description Embedding (MADE) method that unifies personal visual appearance and attribute description for CC-ReID. Specifically, handling variable clothing-sensitive information, such as color and type, is challenging for effective modeling. To address this, we mask the clothing and color information in the personal attribute description extracted through an attribute detection model. The masked attribute description is then connected and embedded into Transformer blocks at various levels, fusing it with the low-level to high-level features of the image. This approach compels the model to discard clothing information. Experiments are conducted on several CC-ReID benchmarks, including PRCC, LTCC, Celeb-reID-light, and LaST. Results demonstrate that MADE effectively utilizes attribute description, enhancing cloth-changing person re-identification performance, and compares favorably with state-of-the-art methods. The code is available at https://github.com/moon-wh/MADE.
Abstract:This paper studies the problem of zero-shot sketch-based image retrieval (ZS-SBIR), which aims to use sketches from unseen categories as queries to match the images of the same category. Due to the large cross-modality discrepancy, ZS-SBIR is still a challenging task and mimics realistic zero-shot scenarios. The key is to leverage transferable knowledge from the pre-trained model to improve generalizability. Existing researchers often utilize the simple fine-tuning training strategy or knowledge distillation from a teacher model with fixed parameters, lacking efficient bidirectional knowledge alignment between student and teacher models simultaneously for better generalization. In this paper, we propose a novel Symmetrical Bidirectional Knowledge Alignment for zero-shot sketch-based image retrieval (SBKA). The symmetrical bidirectional knowledge alignment learning framework is designed to effectively learn mutual rich discriminative information between teacher and student models to achieve the goal of knowledge alignment. Instead of the former one-to-one cross-modality matching in the testing stage, a one-to-many cluster cross-modality matching method is proposed to leverage the inherent relationship of intra-class images to reduce the adverse effects of the existing modality gap. Experiments on several representative ZS-SBIR datasets (Sketchy Ext dataset, TU-Berlin Ext dataset and QuickDraw Ext dataset) prove the proposed algorithm can achieve superior performance compared with state-of-the-art methods.
Abstract:Deepfake detection refers to detecting artificially generated or edited faces in images or videos, which plays an essential role in visual information security. Despite promising progress in recent years, Deepfake detection remains a challenging problem due to the complexity and variability of face forgery techniques. Existing Deepfake detection methods are often devoted to extracting features by designing sophisticated networks but ignore the influence of perceptual quality of faces. Considering the complexity of the quality distribution of both real and fake faces, we propose a novel Deepfake detection framework named DeepFidelity to adaptively distinguish real and fake faces with varying image quality by mining the perceptual forgery fidelity of face images. Specifically, we improve the model's ability to identify complex samples by mapping real and fake face data of different qualities to different scores to distinguish them in a more detailed way. In addition, we propose a network structure called Symmetric Spatial Attention Augmentation based vision Transformer (SSAAFormer), which uses the symmetry of face images to promote the network to model the geographic long-distance relationship at the shallow level and augment local features. Extensive experiments on multiple benchmark datasets demonstrate the superiority of the proposed method over state-of-the-art methods.
Abstract:Identity tracing is a technology that uses the selection and collection of identity attributes of the object to be tested to discover its true identity, and it is one of the most important foundational issues in the field of social security prevention. However, traditional identity recognition technologies based on single attributes have difficulty achieving ultimate recognition accuracy, where deep learning-based model always lacks interpretability. Multivariate attribute collaborative identification is a possible key way to overcome the mentioned recognition errors and low data quality problems. In this paper, we propose the Trustworthy Identity Tracing (TIT) task and a Multi-attribute Synergistic Identification based TIT framework. We first established a novel identity model based on identity entropy theoretically. The individual conditional identity entropy and core identification set are defined to reveal the intrinsic mechanism of multivariate attribute collaborative identification. Based on the proposed identity model, we propose a trustworthy identity tracing framework (TITF) with multi-attribute synergistic identification to determine the identity of unknown objects, which can optimize the core identification set and provide an interpretable identity tracing process. Actually, the essence of identity tracing is revealed to be the process of the identity entropy value converging to zero. To cope with the lack of test data, we construct a dataset of 1000 objects to simulate real-world scenarios, where 20 identity attributes are labeled to trace unknown object identities. The experiment results conducted on the mentioned dataset show the proposed TITF algorithm can achieve satisfactory identification performance.
Abstract:Unsupervised image Anomaly Detection (UAD) aims to learn robust and discriminative representations of normal samples. While separate solutions per class endow expensive computation and limited generalizability, this paper focuses on building a unified framework for multiple classes. Under such a challenging setting, popular reconstruction-based networks with continuous latent representation assumption always suffer from the "identical shortcut" issue, where both normal and abnormal samples can be well recovered and difficult to distinguish. To address this pivotal issue, we propose a hierarchical vector quantized prototype-oriented Transformer under a probabilistic framework. First, instead of learning the continuous representations, we preserve the typical normal patterns as discrete iconic prototypes, and confirm the importance of Vector Quantization in preventing the model from falling into the shortcut. The vector quantized iconic prototype is integrated into the Transformer for reconstruction, such that the abnormal data point is flipped to a normal data point.Second, we investigate an exquisite hierarchical framework to relieve the codebook collapse issue and replenish frail normal patterns. Third, a prototype-oriented optimal transport method is proposed to better regulate the prototypes and hierarchically evaluate the abnormal score. By evaluating on MVTec-AD and VisA datasets, our model surpasses the state-of-the-art alternatives and possesses good interpretability. The code is available at https://github.com/RuiyingLu/HVQ-Trans.
Abstract:Due to the successful development of deep image generation technology, visual data forgery detection would play a more important role in social and economic security. Existing forgery detection methods suffer from unsatisfactory generalization ability to determine the authenticity in the unseen domain. In this paper, we propose a novel Attention Consistency Refined masked frequency forgery representation model toward generalizing face forgery detection algorithm (ACMF). Most forgery technologies always bring in high-frequency aware cues, which make it easy to distinguish source authenticity but difficult to generalize to unseen artifact types. The masked frequency forgery representation module is designed to explore robust forgery cues by randomly discarding high-frequency information. In addition, we find that the forgery attention map inconsistency through the detection network could affect the generalizability. Thus, the forgery attention consistency is introduced to force detectors to focus on similar attention regions for better generalization ability. Experiment results on several public face forgery datasets (FaceForensic++, DFD, Celeb-DF, and WDF datasets) demonstrate the superior performance of the proposed method compared with the state-of-the-art methods.
Abstract:Image super-resolution (SR) is a technique to recover lost high-frequency information in low-resolution (LR) images. Spatial-domain information has been widely exploited to implement image SR, so a new trend is to involve frequency-domain information in SR tasks. Besides, image SR is typically application-oriented and various computer vision tasks call for image arbitrary magnification. Therefore, in this paper, we study image features in the frequency domain to design a novel scale-arbitrary image SR network. First, we statistically analyze LR-HR image pairs of several datasets under different scale factors and find that the high-frequency spectra of different images under different scale factors suffer from different degrees of degradation, but the valid low-frequency spectra tend to be retained within a certain distribution range. Then, based on this finding, we devise an adaptive scale-aware feature division mechanism using deep reinforcement learning, which can accurately and adaptively divide the frequency spectrum into the low-frequency part to be retained and the high-frequency one to be recovered. Finally, we design a scale-aware feature recovery module to capture and fuse multi-level features for reconstructing the high-frequency spectrum at arbitrary scale factors. Extensive experiments on public datasets show the superiority of our method compared with state-of-the-art methods.
Abstract:Detecting abnormal crowd motion emerging from complex interactions of individuals is paramount to ensure the safety of crowds. Crowd-level abnormal behaviors (CABs), e.g., counter flow and crowd turbulence, are proven to be the crucial causes of many crowd disasters. In the recent decade, video anomaly detection (VAD) techniques have achieved remarkable success in detecting individual-level abnormal behaviors (e.g., sudden running, fighting and stealing), but research on VAD for CABs is rather limited. Unlike individual-level anomaly, CABs usually do not exhibit salient difference from the normal behaviors when observed locally, and the scale of CABs could vary from one scenario to another. In this paper, we present a systematic study to tackle the important problem of VAD for CABs with a novel crowd motion learning framework, multi-scale motion consistency network (MSMC-Net). MSMC-Net first captures the spatial and temporal crowd motion consistency information in a graph representation. Then, it simultaneously trains multiple feature graphs constructed at different scales to capture rich crowd patterns. An attention network is used to adaptively fuse the multi-scale features for better CAB detection. For the empirical study, we consider three large-scale crowd event datasets, UMN, Hajj and Love Parade. Experimental results show that MSMC-Net could substantially improve the state-of-the-art performance on all the datasets.