Abstract:Recent text-to-video advancements have enabled coherent video synthesis from prompts and expanded to fine-grained control over appearance and motion. However, existing methods either suffer from concept interference due to feature domain mismatch caused by naive decoupled optimizations or exhibit appearance contamination induced by spatial feature leakage resulting from the entanglement of motion and appearance in reference video reconstructions. In this paper, we propose JointTuner, a novel adaptive joint training framework, to alleviate these issues. Specifically, we develop Adaptive LoRA, which incorporates a context-aware gating mechanism, and integrate the gated LoRA components into the spatial and temporal Transformers within the diffusion model. These components enable simultaneous optimization of appearance and motion, eliminating concept interference. In addition, we introduce the Appearance-independent Temporal Loss, which decouples motion patterns from intrinsic appearance in reference video reconstructions through an appearance-agnostic noise prediction task. The key innovation lies in adding frame-wise offset noise to the ground-truth Gaussian noise, perturbing its distribution, thereby disrupting spatial attributes associated with frames while preserving temporal coherence. Furthermore, we construct a benchmark comprising 90 appearance-motion customized combinations and 10 multi-type automatic metrics across four dimensions, facilitating a more comprehensive evaluation for this customization task. Extensive experiments demonstrate the superior performance of our method compared to current advanced approaches.
Abstract:We introduced "Presidifussion," a novel approach to learning and replicating the unique style of calligraphy of President Xu, using a pretrained diffusion model adapted through a two-stage training process. Initially, our model is pretrained on a diverse dataset containing works from various calligraphers. This is followed by fine-tuning on a smaller, specialized dataset of President Xu's calligraphy, comprising just under 200 images. Our method introduces innovative techniques of font image conditioning and stroke information conditioning, enabling the model to capture the intricate structural elements of Chinese characters. The effectiveness of our approach is demonstrated through a comparison with traditional methods like zi2zi and CalliGAN, with our model achieving comparable performance using significantly smaller datasets and reduced computational resources. This work not only presents a breakthrough in the digital preservation of calligraphic art but also sets a new standard for data-efficient generative modeling in the domain of cultural heritage digitization.