Abstract:Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
Abstract:Anomaly detection (AD) has attracted considerable attention in both academia and industry. Due to the lack of anomalous data in many practical cases, AD is usually solved by first modeling the normal data pattern and then determining if data fit this model. Generative models (GMs) seem a natural tool to achieve this purpose, which learn the normal data distribution and estimate it using a probability density function (PDF). However, some works have observed the ideal performance of such GM-based AD methods. In this paper, we propose a new perspective on the ideal performance of GM-based AD methods. We state that in these methods, the implicit assumption that connects GMs'results to AD's goal is usually implausible due to normal data's multi-peaked distribution characteristic, which is quite common in practical cases. We first qualitatively formulate this perspective, and then focus on the Gaussian mixture model (GMM) to intuitively illustrate the perspective, which is a typical GM and has the natural property to approximate multi-peaked distributions. Based on the proposed perspective, in order to bypass the implicit assumption in the GMM-based AD method, we suggest integrating the Discriminative idea to orient GMM to AD tasks (DiGMM). With DiGMM, we establish a connection of generative and discriminative models, which are two key paradigms for AD and are usually treated separately before. This connection provides a possible direction for future works to jointly consider the two paradigms and incorporate their complementary characteristics for AD.