Abstract:Remarkable progress has been achieved in image generation with the introduction of generative models. However, precisely controlling the content in generated images remains a challenging task due to their fundamental training objective. This paper addresses this challenge by proposing a novel image generation framework explicitly designed to incorporate desired content in output images. The framework utilizes advanced encoding techniques, integrating subnetworks called content fusion and frequency encoding modules. The frequency encoding module first captures features and structures of reference images by exclusively focusing on selected frequency components. Subsequently, the content fusion module generates a content-guiding vector that encapsulates desired content features. During the image generation process, content-guiding vectors from real images are fused with projected noise vectors. This ensures the production of generated images that not only maintain consistent content from guiding images but also exhibit diverse stylistic variations. To validate the effectiveness of the proposed framework in preserving content attributes, extensive experiments are conducted on widely used benchmark datasets, including Flickr-Faces-High Quality, Animal Faces High Quality, and Large-scale Scene Understanding datasets.