Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

May 17, 2023

Zhenxing Zhang, Lambert Schomaker

Figure 1 for Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

Figure 2 for Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

Figure 3 for Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

Figure 4 for Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

Share this with someone who'll enjoy it:

Abstract:The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal. Recently, various studies have focused on this task and have achieved promising performance. However, current speech-to-image approaches are based on a stacked modular framework that suffers from three vital issues: 1) Training separate networks is time-consuming as well as inefficient and the convergence of the final generative model strongly depends on the previous generators; 2) The quality of precursor images is ignored by this architecture; 3) Multiple discriminator networks are required to be trained. To this end, we propose an efficient and effective single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples on the basis of given spoken descriptions. Fusion-S2iGan introduces a visual+speech fusion module (VSFM), constructed with a pixel-attention module (PAM), a speech-modulation module (SMM) and a weighted-fusion module (WFM), to inject the speech embedding from a speech encoder into the generator while improving the quality of synthesized pictures. Fusion-S2iGan spreads the bimodal information over all layers of the generator network to reinforce the visual feature maps at various hierarchical levels in the architecture. We conduct a series of experiments on four benchmark data sets, i.e., CUB birds, Oxford-102, Flickr8k and Places-subset. The experimental results demonstrate the superiority of the presented Fusion-S2iGan compared to the state-of-the-art models with a multi-stage architecture and a performance level that is close to traditional text-to-image approaches.

View paper on

Share this with someone who'll enjoy it:

Title:Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for Speech-to-Image Generation

Paper and Code