Abstract:Generating images from human sketches typically requires dedicated networks trained from scratch. In contrast, the emergence of the pre-trained Vision-Language models (e.g., CLIP) has propelled generative applications based on controlling the output imagery of existing StyleGAN models with text inputs or reference images. Parallelly, our work proposes a framework to control StyleGAN imagery with a single user sketch. In particular, we learn a conditional distribution in the latent space of a pre-trained StyleGAN model via energy-based learning and propose two novel energy functions leveraging CLIP for cross-domain semantic supervision. Once trained, our model can generate multi-modal images semantically aligned with the input sketch. Quantitative evaluations on synthesized datasets have shown that our approach improves significantly from previous methods in the one-shot regime. The superiority of our method is further underscored when experimenting with a wide range of human sketches of diverse styles and poses. Surprisingly, our models outperform the previous baseline regarding both the range of sketch inputs and image qualities despite operating with a stricter setting: with no extra training data and single sketch input.