Abstract:With the development of the convolutional neural network, image style transfer has drawn increasing attention. However, most existing approaches adopt a global feature transformation to transfer style patterns into content images (e.g., AdaIN and WCT). Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results. To solve this problem, we propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation. Specifically, two attention mechanisms are adopted in our STTR. We first propose to use self-attention to encode content and style tokens such that similar tokens can be grouped and learned together. We then adopt cross-attention between content and style tokens that encourages fine-grained style transformations. To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk (AMT), which are carried out with 50 human subjects with 1,000 votes in total. Extensive evaluations demonstrate the effectiveness and efficiency of the proposed STTR in generating visually pleasing style transfer results.
Abstract:We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e.g., a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i.e.,TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i.e., the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with 1024x1024 size, dispensing with convolutions entirely.
Abstract:We propose Image-Semantic-Transformation-Reconstruction-Circle(ISTRC) model, a novel and powerful method using facenet's Euclidean latent space to understand the images. As the name suggests, ISTRC construct the circle, able to perfectly reconstruct images. One powerful Euclidean latent space embedded in ISTRC is FaceNet's last layer with the power of distinguishing and understanding images. Our model will reconstruct the images and manipulate Euclidean latent vectors to achieve semantic transformations and semantic images arthimetic calculations. In this paper, we show that ISTRC performs 10 high-level semantic transformations like "Male and female","add smile","open mouth", "deduct beard or add mustache", "bigger/smaller nose", "make older and younger", "bigger lips", "bigger eyes", "bigger/smaller mouths" and "more attractive". It just takes 3 hours(GTX 1080) to train the models of 10 semantic transformations.