Abstract:Large numbers of synthesized videos from diffusion models pose threats to information security and authenticity, leading to an increasing demand for generated content detection. However, existing video-level detection algorithms primarily focus on detecting facial forgeries and often fail to identify diffusion-generated content with a diverse range of semantics. To advance the field of video forensics, we propose an innovative algorithm named Multi-Modal Detection(MM-Det) for detecting diffusion-generated videos. MM-Det utilizes the profound perceptual and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR) from LMM's multi-modal space, enhancing its ability to detect unseen forgery content. Besides, MM-Det leverages an In-and-Across Frame Attention (IAFA) mechanism for feature augmentation in the spatio-temporal domain. A dynamic fusion strategy helps refine forgery representations for the fusion. Moreover, we construct a comprehensive diffusion video dataset, called Diffusion Video Forensics (DVF), across a wide range of forgery videos. MM-Det achieves state-of-the-art performance in DVF, demonstrating the effectiveness of our algorithm. Both source code and DVF are available at https://github.com/SparkleXFantasy/MM-Det.
Abstract:Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: a multi-branch feature extractor, a language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment pixel-level forgery regions and detect image-level forgery, respectively. Also, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on $8$ by using different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset are available.
Abstract:Rapid advancements in multimodal large language models have enabled the creation of hyper-realistic images from textual descriptions. However, these advancements also raise significant concerns about unauthorized use, which hinders their broader distribution. Traditional watermarking methods often require complex integration or degrade image quality. To address these challenges, we introduce a novel framework Towards Effective user Attribution for latent diffusion models via Watermark-Informed Blending (TEAWIB). TEAWIB incorporates a unique ready-to-use configuration approach that allows seamless integration of user-specific watermarks into generative models. This approach ensures that each user can directly apply a pre-configured set of parameters to the model without altering the original model parameters or compromising image quality. Additionally, noise and augmentation operations are embedded at the pixel level to further secure and stabilize watermarked images. Extensive experiments validate the effectiveness of TEAWIB, showcasing the state-of-the-art performance in perceptual quality and attribution accuracy.
Abstract:Model Parsing defines the research task of predicting hyperparameters of the generative model (GM), given a generated image as input. Since a diverse set of hyperparameters is jointly employed by the generative model, and dependencies often exist among them, it is crucial to learn these hyperparameter dependencies for the improved model parsing performance. To explore such important dependencies, we propose a novel model parsing method called Learnable Graph Pooling Network (LGPN). Specifically, we transform model parsing into a graph node classification task, using graph nodes and edges to represent hyperparameters and their dependencies, respectively. Furthermore, LGPN incorporates a learnable pooling-unpooling mechanism tailored to model parsing, which adaptively learns hyperparameter dependencies of GMs used to generate the input image. We also extend our proposed method to CNN-generated image detection and coordinate attacks detection. Empirically, we achieve state-of-the-art results in model parsing and its extended applications, showing the effectiveness of our method. Our source code are available.
Abstract:Modern multi-layer networks are commonly stored and analyzed in a local and distributed fashion because of the privacy, ownership, and communication costs. The literature on the model-based statistical methods for community detection based on these data is still limited. This paper proposes a new method for consensus community detection and estimation in a multi-layer stochastic block model using locally stored and computed network data with privacy protection. A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed. To preserve the edges' privacy, we adopt the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy. The ppDSC algorithm is performed on the squared RR-perturbed adjacency matrices to prevent possible cancellation of communities among different layers. To remove the bias incurred by RR and the squared network matrices, we develop a two-step bias-adjustment procedure. Then we perform eigen-decomposition on the debiased matrices, aggregation of the local eigenvectors using an orthogonal Procrustes transformation, and k-means clustering. We provide theoretical analysis on the statistical errors of ppDSC in terms of eigen-vector estimation. In addition, the blessings and curses of network heterogeneity are well-explained by our bounds.
Abstract:Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. Our proposed IFDL framework contains three components: multi-branch feature extractor, localization and classification modules. Each branch of the feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on $7$ different benchmarks, for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found: \href{https://github.com/CHELSEA234/HiFi_IFDL}{github.com/CHELSEA234/HiFi-IFDL}.
Abstract:Identifying functional connectivity biomarkers of major depressive disorder (MDD) patients is essential to advance understanding of the disorder mechanisms and early intervention. However, due to the small sample size and the high dimension of available neuroimaging data, the performance of existing methods is often limited. Multi-site data could enhance the statistical power and sample size, while they are often subject to inter-site heterogeneity and data-sharing policies. In this paper, we propose a federated joint estimator, NOTEARS-PFL, for simultaneous learning of multiple Bayesian networks (BNs) with continuous optimization, to identify disease-induced alterations in MDD patients. We incorporate information shared between sites and site-specific information into the proposed federated learning framework to learn personalized BN structures by introducing the group fused lasso penalty. We develop the alternating direction method of multipliers, where in the local update step, the neuroimaging data is processed at each local site. Then the learned network structures are transmitted to the center for the global update. In particular, we derive a closed-form expression for the local update step and use the iterative proximal projection method to deal with the group fused lasso penalty in the global update step. We evaluate the performance of the proposed method on both synthetic and real-world multi-site rs-fMRI datasets. The results suggest that the proposed NOTEARS-PFL yields superior effectiveness and accuracy than the comparable methods.
Abstract:In this work, we study multi-domain learning for face anti-spoofing(MD-FAS), where a pre-trained FAS model needs to be updated to perform equally well on both source and target domains while only using target domain data for updating. We present a new model for MD-FAS, which addresses the forgetting issue when learning new domain data, while possessing a high level of adaptability. First, we devise a simple yet effective module, called spoof region estimator(SRE), to identify spoof traces in the spoof image. Such spoof traces reflect the source pre-trained model's responses that help upgraded models combat catastrophic forgetting during updating. Unlike prior works that estimate spoof traces which generate multiple outputs or a low-resolution binary mask, SRE produces one single, detailed pixel-wise estimate in an unsupervised manner. Secondly, we propose a novel framework, named FAS-wrapper, which transfers knowledge from the pre-trained models and seamlessly integrates with different FAS models. Lastly, to help the community further advance MD-FAS, we construct a new benchmark based on SIW, SIW-Mv2 and Oulu-NPU, and introduce four distinct protocols for evaluation, where source and target domains are different in terms of spoof type, age, ethnicity, and illumination. Our proposed method achieves superior performance on the MD-FAS benchmark than previous methods. Our code and newly curated SIW-Mv2 are publicly available.
Abstract:Singular value decomposition (SVD) is one of the most fundamental tools in machine learning and statistics.The modern machine learning community usually assumes that data come from and belong to small-scale device users. The low communication and computation power of such devices, and the possible privacy breaches of users' sensitive data make the computation of SVD challenging. Federated learning (FL) is a paradigm enabling a large number of devices to jointly learn a model in a communication-efficient way without data sharing. In the FL framework, we develop a class of algorithms called FedPower for the computation of partial SVD in the modern setting. Based on the well-known power method, the local devices alternate between multiple local power iterations and one global aggregation to improve communication efficiency. In the aggregation, we propose to weight each local eigenvector matrix with Orthogonal Procrustes Transformation (OPT). Considering the practical stragglers' effect, the aggregation can be fully participated or partially participated, where for the latter we propose two sampling and aggregation schemes. Further, to ensure strong privacy protection, we add Gaussian noise whenever the communication happens by adopting the notion of differential privacy (DP). We theoretically show the convergence bound for FedPower. The resulting bound is interpretable with each part corresponding to the effect of Gaussian noise, parallelization, and random sampling of devices, respectively. We also conduct experiments to demonstrate the merits of FedPower. In particular, the local iterations not only improve communication efficiency but also reduce the chance of privacy breaches.
Abstract:The ability to capture complex linguistic structures and long-term dependencies among words in the passage is essential for many natural language understanding tasks. In relation extraction, dependency trees that contain rich syntactic clues have been widely used to help capture long-term dependencies in text. Graph neural networks (GNNs), one of the means to encode dependency graphs, has been shown effective in several prior works. However, relatively little attention has been paid to the receptive fields of GNNs, which can be crucial in tasks with extremely long text that go beyond single sentences and require discourse analysis. In this work, we leverage the idea of graph pooling and propose the Mirror Graph Convolution Network (MrGCN), a GNN model with pooling-unpooling structures tailored to relation extraction. The pooling branch reduces the graph size and enables the GCN to obtain larger receptive fields within less layers; the unpooling branch restores the pooled graph to its original resolution such that token-level relation extraction can be performed. Experiments on two datasets demonstrate the effectiveness of our method, showing significant improvements over previous results.