Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Port

Face-to-Music Translation Using a Distance-Preserving Generative Adversarial Network with an Auxiliary Discriminator

Jun 24, 2020

Chelhwon Kim, Andrew Port, Mitesh Patel

Figure 1 for Face-to-Music Translation Using a Distance-Preserving Generative Adversarial Network with an Auxiliary Discriminator

Figure 2 for Face-to-Music Translation Using a Distance-Preserving Generative Adversarial Network with an Auxiliary Discriminator

Figure 3 for Face-to-Music Translation Using a Distance-Preserving Generative Adversarial Network with an Auxiliary Discriminator

Figure 4 for Face-to-Music Translation Using a Distance-Preserving Generative Adversarial Network with an Auxiliary Discriminator

Abstract:Learning a mapping between two unrelated domains-such as image and audio, without any supervision is a challenging task. In this work, we propose a distance-preserving generative adversarial model to translate images of human faces into an audio domain. The audio domain is defined by a collection of musical note sounds recorded by 10 different instrument families (NSynth \cite{nsynth2017}) and a distance metric where the instrument family class information is incorporated together with a mel-frequency cepstral coefficients (MFCCs) feature. To enforce distance-preservation, a loss term that penalizes difference between pairwise distances of the faces and the translated audio samples is used. Further, we discover that the distance preservation constraint in the generative adversarial model leads to reduced diversity in the translated audio samples, and propose the use of an auxiliary discriminator to enhance the diversity of the translations while using the distance preservation constraint. We also provide a visual demonstration of the results and numerical analysis of the fidelity of the translations. A video demo of our proposed model's learned translation is available in https://www.dropbox.com/s/the176w9obq8465/face_to_musical_note.mov?dl=0.

* 15 pages, 3 figures

Via

Access Paper or Ask Questions

Earballs: Neural Transmodal Translation

Jun 05, 2020

Andrew Port, Chelhwon Kim, Mitesh Patel

Figure 1 for Earballs: Neural Transmodal Translation

Figure 2 for Earballs: Neural Transmodal Translation

Figure 3 for Earballs: Neural Transmodal Translation

Figure 4 for Earballs: Neural Transmodal Translation

Abstract:As is expressed in the adage "a picture is worth a thousand words", when using spoken language to communicate visual information, brevity can be a challenge. This work describes a novel technique for leveraging machine learned feature embeddings to translate visual (and other types of) information into a perceptual audio domain, allowing users to perceive this information using only their aural faculty. The system uses a pretrained image embedding network to extract visual features and embed them in a compact subset of Euclidean space -- this converts the images into feature vectors whose $L^2$ distances can be used as a meaningful measure of similarity. A generative adversarial network (GAN) is then used to find a distance preserving map from this metric space of feature vectors into the metric space defined by a target audio dataset equipped with either the Euclidean metric or a mel-frequency cepstrum-based psychoacoustic distance metric. We demonstrate this technique by translating images of faces into human speech-like audio. For both target audio metrics, the GAN successfully found a metric preserving mapping, and in human subject tests, users were able to accurately classify audio translations of faces.

* 9 pages, 3 figures

Via

Access Paper or Ask Questions