Abstract:In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
Abstract:Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model's cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.
Abstract:Deepfake attribution (DFA) aims to perform multiclassification on different facial manipulation techniques, thereby mitigating the detrimental effects of forgery content on the social order and personal reputations. However, previous methods focus only on method-specific clues, which easily lead to overfitting, while overlooking the crucial role of common forgery features. Additionally, they struggle to distinguish between uncertain novel classes in more practical open-world scenarios. To address these issues, in this paper we propose an innovative multi-DisentAnglement based conTrastive leArning framework, DATA, to enhance the generalization ability on novel classes for the open-world semi-supervised deepfake attribution (OSS-DFA) task. Specifically, since all generation techniques can be abstracted into a similar architecture, DATA defines the concept of 'Orthonormal Deepfake Basis' for the first time and utilizes it to disentangle method-specific features, thereby reducing the overfitting on forgery-irrelevant information. Furthermore, an augmented-memory mechanism is designed to assist in novel class discovery and contrastive learning, which aims to obtain clear class boundaries for the novel classes through instance-level disentanglements. Additionally, to enhance the standardization and discrimination of features, DATA uses bases contrastive loss and center contrastive loss as auxiliaries for the aforementioned modules. Extensive experimental evaluations show that DATA achieves state-of-the-art performance on the OSS-DFA benchmark, e.g., there are notable accuracy improvements in 2.55% / 5.7% under different settings, compared with the existing methods.
Abstract:Deepfake detection models often struggle with generalization to unseen datasets, manifesting as misclassifying real instances as fake in target domains. This is primarily due to an overreliance on forgery artifacts and a limited understanding of real faces. To address this challenge, we propose a novel approach RealID to enhance generalization by learning a comprehensive concept of real faces while assessing the probabilities of belonging to the real and fake classes independently. RealID comprises two key modules: the Real Concept Capture Module (RealC2) and the Independent Dual-Decision Classifier (IDC). With the assistance of a MultiReal Memory, RealC2 maintains various prototypes for real faces, allowing the model to capture a comprehensive concept of real class. Meanwhile, IDC redefines the classification strategy by making independent decisions based on the concept of the real class and the presence of forgery artifacts. Through the combined effect of the above modules, the influence of forgery-irrelevant patterns is alleviated, and extensive experiments on five widely used datasets demonstrate that RealID significantly outperforms existing state-of-the-art methods, achieving a 1.74% improvement in average accuracy.
Abstract:Concept Factorization (CF) models have attracted widespread attention due to their excellent performance in data clustering. In recent years, many variant models based on CF have achieved great success in clustering by taking into account the internal geometric manifold structure of the dataset and using graph regularization techniques. However, their clustering performance depends greatly on the construction of the initial graph structure. In order to enable adaptive learning of the graph structure of the data, we propose a Concept Factorization Based on Self-Representation and Adaptive Graph Structure Learning (CFSRAG) Model. CFSRAG learns the affinity relationship between data through a self-representation method, and uses the learned affinity matrix to implement dynamic graph regularization constraints, thereby ensuring dynamic learning of the internal geometric structure of the data. Finally, we give the CFSRAG update rule and convergence analysis, and conduct comparative experiments on four real datasets. The results show that our model outperforms other state-of-the-art models.
Abstract:Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.
Abstract:The advent of local continuous image function (LIIF) has garnered significant attention for arbitrary-scale super-resolution (SR) techniques. However, while the vulnerabilities of fixed-scale SR have been assessed, the robustness of continuous representation-based arbitrary-scale SR against adversarial attacks remains an area warranting further exploration. The elaborately designed adversarial attacks for fixed-scale SR are scale-dependent, which will cause time-consuming and memory-consuming problems when applied to arbitrary-scale SR. To address this concern, we propose a simple yet effective ``scale-invariant'' SR adversarial attack method with good transferability, termed SIAGT. Specifically, we propose to construct resource-saving attacks by exploiting finite discrete points of continuous representation. In addition, we formulate a coordinate-dependent loss to enhance the cross-model transferability of the attack. The attack can significantly deteriorate the SR images while introducing imperceptible distortion to the targeted low-resolution (LR) images. Experiments carried out on three popular LIIF-based SR approaches and four classical SR datasets show remarkable attack performance and transferability of SIAGT.
Abstract:Current fine-grained classification research mainly concentrates on fine-grained feature learning, but in real-world applications, the bigger issue often lies in the data. Fine-grained data annotation is challenging, and the features and semantics are highly diverse and frequently changing, making traditional methods less effective in real-world scenarios. Although some studies have provided potential solutions to this issue, most are limited to making use of limited supervised information. In this paper, we propose a novel learning paradigm to break barriers in fine-grained classification. It enables the model to learn beyond the standard training phase and benefit from cost-free data encountered during system operation. On this basis, an efficient EXPloring and EXPloiting strategy and method (EXP2) is designed. Thereinto, before the final classification results are obtained, representative inference data samples are explored according to class templates and exploited to optimize classifiers. Experimental results demonstrate the general effectiveness of EXP2.
Abstract:Recent advances in NeRF inpainting have leveraged pretrained diffusion models to enhance performance. However, these methods often yield suboptimal results due to their ineffective utilization of 2D diffusion priors. The limitations manifest in two critical aspects: the inadequate capture of geometric information by pretrained diffusion models and the suboptimal guidance provided by existing Score Distillation Sampling (SDS) methods. To address these problems, we introduce GB-NeRF, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors. Our approach incorporates two key innovations: a fine-tuning strategy that simultaneously learns appearance and geometric priors and a specialized normal distillation loss that integrates these geometric priors into NeRF inpainting. We propose a technique called Balanced Score Distillation (BSD) that surpasses existing methods such as Score Distillation (SDS) and the improved version, Conditional Score Distillation (CSD). BSD offers improved inpainting quality in appearance and geometric aspects. Extensive experiments show that our method provides superior appearance fidelity and geometric consistency compared to existing approaches.
Abstract:Cognitive diagnosis (CD) utilizes students' existing studying records to estimate their mastery of unknown knowledge concepts, which is vital for evaluating their learning abilities. Accurate CD is extremely challenging because CD is associated with complex relationships and mechanisms among students, knowledge concepts, studying records, etc. However, existing approaches loosely consider these relationships and mechanisms by a non-end-to-end learning framework, resulting in sub-optimal feature extractions and fusions for CD. Different from them, this paper innovatively proposes an End-to-end Graph Neural Networks-based Cognitive Diagnosis (EGNN-CD) model. EGNN-CD consists of three main parts: knowledge concept network (KCN), graph neural networks-based feature extraction (GNNFE), and cognitive ability prediction (CAP). First, KCN constructs CD-related interaction by comprehensively extracting physical information from students, exercises, and knowledge concepts. Second, a four-channel GNNFE is designed to extract high-order and individual features from the constructed KCN. Finally, CAP employs a multi-layer perceptron to fuse the extracted features to predict students' learning abilities in an end-to-end learning way. With such designs, the feature extractions and fusions are guaranteed to be comprehensive and optimal for CD. Extensive experiments on three real datasets demonstrate that our EGNN-CD achieves significantly higher accuracy than state-of-the-art models in CD.