Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:RealCustom++: Representing Images as Real-Word for Real-Time Customization

Aug 19, 2024

Zhendong Mao, Mengqi Huang, Fei Ding, Mingcong Liu, Qian He, Xiaojun Chang, Yongdong Zhang

Figure 1 for RealCustom++: Representing Images as Real-Word for Real-Time Customization

Figure 2 for RealCustom++: Representing Images as Real-Word for Real-Time Customization

Figure 3 for RealCustom++: Representing Images as Real-Word for Real-Time Customization

Figure 4 for RealCustom++: Representing Images as Real-Word for Real-Time Customization

Share this with someone who'll enjoy it:

Abstract:Text-to-image customization, which takes given texts and images depicting given subjects as inputs, aims to synthesize new images that align with both text semantics and subject appearance. This task provides precise control over details that text alone cannot capture and is fundamental for various real-world applications, garnering significant interest from academia and industry. Existing works follow the pseudo-word paradigm, which involves representing given subjects as pseudo-words and combining them with given texts to collectively guide the generation. However, the inherent conflict and entanglement between the pseudo-words and texts result in a dual-optimum paradox, where subject similarity and text controllability cannot be optimal simultaneously. We propose a novel real-words paradigm termed RealCustom++ that instead represents subjects as non-conflict real words, thereby disentangling subject similarity from text controllability and allowing both to be optimized simultaneously. Specifically, RealCustom++ introduces a novel "train-inference" decoupled framework: (1) During training, RealCustom++ learns the alignment between vision conditions and all real words in the text, ensuring high subject-similarity generation in open domains. This is achieved by the cross-layer cross-scale projector to robustly and finely extract subject features, and a curriculum training recipe that adapts the generated subject to diverse poses and sizes. (2) During inference, leveraging the learned general alignment, an adaptive mask guidance is proposed to only customize the generation of the specific target real word, keeping other subject-irrelevant regions uncontaminated to ensure high text-controllability in real-time.

* 23 pages

View paper on

Share this with someone who'll enjoy it:

Title:RealCustom++: Representing Images as Real-Word for Real-Time Customization

Paper and Code