Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hyunwoo Ha

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Dec 09, 2024

Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Tae-Hyun Oh

Figure 1 for Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Figure 2 for Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Figure 3 for Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Figure 4 for Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Abstract:How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.

* Under-review

Via

Access Paper or Ask Questions

MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation

Aug 21, 2024

Kim Yu-Ji, Hyunwoo Ha, Kim Youwang, Jaeheung Surh, Hyowon Ha, Tae-Hyun Oh

Abstract:Reconstructing 3D from a single view image is a long-standing challenge. One of the popular approaches to tackle this problem is learning-based methods, but dealing with the test cases unfamiliar with training data (Out-of-distribution; OoD) introduces an additional challenge. To adapt for unseen samples in test time, we propose MeTTA, a test-time adaptation (TTA) exploiting generative prior. We design joint optimization of 3D geometry, appearance, and pose to handle OoD cases with only a single view image. However, the alignment between the reference image and the 3D shape via the estimated viewpoint could be erroneous, which leads to ambiguity. To address this ambiguity, we carefully design learnable virtual cameras and their self-calibration. In our experiments, we demonstrate that MeTTA effectively deals with OoD scenarios at failure cases of existing learning-based 3D reconstruction models and enables obtaining a realistic appearance with physically based rendering (PBR) textures.

* Accepted at BMVC 2024. [Project page] https://metta3d.github.io/

Via

Access Paper or Ask Questions

Revisiting Learning-based Video Motion Magnification for Real-time Processing

Mar 04, 2024

Hyunwoo Ha, Oh Hyun-Bin, Kim Jun-Seong, Kwon Byung-Ki, Kim Sung-Bin, Linh-Tam Tran, Ji-Yun Kim, Sung-Ho Bae, Tae-Hyun Oh

Figure 1 for Revisiting Learning-based Video Motion Magnification for Real-time Processing

Figure 2 for Revisiting Learning-based Video Motion Magnification for Real-time Processing

Figure 3 for Revisiting Learning-based Video Motion Magnification for Real-time Processing

Figure 4 for Revisiting Learning-based Video Motion Magnification for Real-time Processing

Abstract:Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully demonstrates the modelling of the motion magnification problem with outstanding quality compared to conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being extended to various online applications. In this paper, we investigate an efficient deep learning-based motion magnification model that runs in real time for full-HD resolution videos. Due to the specified network design of the prior art, i.e. inhomogeneous architecture, the direct application of existing neural architecture search methods is complicated. Instead of automatic search, we carefully investigate the architecture module by module for its role and importance in the motion magnification task. Two key findings are 1) Reducing the spatial resolution of the latent motion representation in the decoder provides a good trade-off between computational efficiency and task quality, and 2) surprisingly, only a single linear layer and a single branch in the encoder are sufficient for the motion magnification task. Based on these findings, we introduce a real-time deep learning-based motion magnification model with4.2X fewer FLOPs and is 2.7X faster than the prior art while maintaining comparable quality.

* 19 pages

Via

Access Paper or Ask Questions

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

Mar 30, 2023

Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh

Abstract:How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model's predictions by applying simple manipulations to the input waveform, or to the latent space.

* CVPR 2023

Via

Access Paper or Ask Questions