Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Akio Hayakawa

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Jun 26, 2025

Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

Abstract:We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.

Via

Access Paper or Ask Questions

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Dec 19, 2024

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

Abstract:We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

* Project page: https://hkchengrex.github.io/MMAudio

Via

Access Paper or Ask Questions

Difficult for Whom? A Study of Japanese Lexical Complexity

Oct 24, 2024

Adam Nohejl, Akio Hayakawa, Yusuke Ide, Taro Watanabe

Abstract:The tasks of lexical complexity prediction (LCP) and complex word identification (CWI) commonly presuppose that difficult to understand words are shared by the target population. Meanwhile, personalization methods have also been proposed to adapt models to individual needs. We verify that a recent Japanese LCP dataset is representative of its target population by partially replicating the annotation. By another reannotation we show that native Chinese speakers perceive the complexity differently due to Sino-Japanese vocabulary. To explore the possibilities of personalization, we compare competitive baselines trained on the group mean ratings and individual ratings in terms of performance for an individual. We show that the model trained on a group mean performs similarly to an individual model in the CWI task, while achieving good LCP performance for an individual is difficult. We also experiment with adapting a finetuned BERT model, which results only in marginal improvements across all settings.

* Accepted to TSAR 2024

Via

Access Paper or Ask Questions

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Sep 26, 2024

Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

Figure 1 for A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Figure 2 for A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Figure 3 for A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Figure 4 for A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Abstract:In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular cross-attention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods.

* The source code will be released soon

Via

Access Paper or Ask Questions

Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

May 28, 2024

Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

Figure 1 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Figure 2 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Figure 3 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Figure 4 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Abstract:In this study, we aim to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides each single-modal model to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We theoretically show that this guidance can be computed through the gradient of the optimal discriminator distinguishing real audio-video pairs from fake ones independently generated by the base models. On the basis of this analysis, we construct the joint guidance module by training this discriminator. Additionally, we adopt a loss function to make the gradient of the discriminator work as a noise estimator, as in standard diffusion models, stabilizing the gradient of the discriminator. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multi-modal alignment with a relatively small number of parameters.

Via

Access Paper or Ask Questions

Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion

Mar 28, 2023

Hiromichi Kamata, Yuiko Sakuma, Akio Hayakawa, Masato Ishii, Takuya Narihira

Abstract:We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D. Our method is designed for a novel task, which is to convert a given 3D scene to another scene according to text instructions. Instruct 3D-to-3D applies pretrained Image-to-Image diffusion models for 3D-to-3D conversion. This enables the likelihood maximization of each viewpoint image and high-quality 3D generation. In addition, our proposed method explicitly inputs the source 3D scene as a condition, which enhances 3D consistency and controllability of how much of the source 3D scene structure is reflected. We also propose dynamic scaling, which allows the intensity of the geometry transformation to be adjusted. We performed quantitative and qualitative evaluations and showed that our proposed method achieves higher quality 3D-to-3D conversions than baseline methods.

* Project page: https://sony.github.io/Instruct3Dto3D-doc/

Via

Access Paper or Ask Questions

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

Dec 08, 2022

Naoki Matsunaga, Masato Ishii, Akio Hayakawa, Kenji Suzuki, Takuya Narihira

Abstract:Generative models, particularly GANs, have been utilized for image editing. Although GAN-based methods perform well on generating reasonable contents aligned with the user's intentions, they struggle to strictly preserve the contents outside the editing region. To address this issue, we use diffusion models instead of GANs and propose a novel image-editing method, based on pixel-wise guidance. Specifically, we first train pixel-classifiers with few annotated data and then estimate the semantic segmentation map of a target image. Users then manipulate the map to instruct how the image is to be edited. The diffusion model generates an edited image via guidance by pixel-wise classifiers, such that the resultant image aligns with the manipulated map. As the guidance is conducted pixel-wise, the proposed method can create reasonable contents in the editing region while preserving the contents outside this region. The experimental results validate the advantages of the proposed method both quantitatively and qualitatively.

* 21 pages, 19 figures, fixed figure bugs

Via

Access Paper or Ask Questions

Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Feb 12, 2021

Akio Hayakawa, Masato Ishii, Yoshiyuki Kobayashi, Akira Nakamura, Takuya Narihira, Yukio Obuchi, Andrew Shin, Takuya Yashima, Kazuki Yoshiyama

Figure 1 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Figure 2 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Figure 3 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Figure 4 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Abstract:While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspective, with emphasis on usability and compatibility as its core design principles. We elaborate on each of our design principles and its merits, and validate our attempts via experiments.

* https://nnabla.org

Via

Access Paper or Ask Questions

Reference-Based Video Colorization with Spatiotemporal Correspondence

Nov 25, 2020

Naofumi Akimoto, Akio Hayakawa, Andrew Shin, Takuya Narihira

Figure 1 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Figure 2 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Figure 3 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Figure 4 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Abstract:We propose a novel reference-based video colorization framework with spatiotemporal correspondence. Reference-based methods colorize grayscale frames referencing a user input color frame. Existing methods suffer from the color leakage between objects and the emergence of average colors, derived from non-local semantic correspondence in space. To address this issue, we warp colors only from the regions on the reference frame restricted by correspondence in time. We propagate masks as temporal correspondences, using two complementary tracking approaches: off-the-shelf instance tracking for high performance segmentation, and newly proposed dense tracking to track various types of objects. By restricting temporally-related regions for referencing colors, our approach propagates faithful colors throughout the video. Experiments demonstrate that our method outperforms state-of-the-art methods quantitatively and qualitatively.

Via

Access Paper or Ask Questions

Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Oct 27, 2020

Akio Hayakawa, Takuya Narihira

Figure 1 for Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Figure 2 for Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Figure 3 for Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Figure 4 for Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Abstract:While large neural networks demonstrate higher performance in various tasks, training large networks is difficult due to limitations on GPU memory size. We propose a novel out-of-core algorithm that enables faster training of extremely large-scale neural networks with sizes larger than allotted GPU memory. Under a given memory budget constraint, our scheduling algorithm locally adapts the timing of memory transfers according to memory usage of each function, which improves overlap between computation and memory transfers. Additionally, we apply virtual addressing technique, commonly performed in OS, to training of neural networks with out-of-core execution, which drastically reduces the amount of memory fragmentation caused by frequent memory transfers. With our proposed algorithm, we successfully train ResNet-50 with 1440 batch-size with keeping training speed at 55%, which is 7.5x larger than the upper bound of physical memory. It also outperforms a previous state-of-the-art substantially, i.e. it trains a 1.55x larger network than state-of-the-art with faster execution. Moreover, we experimentally show that our approach is also scalable for various types of networks.

Via

Access Paper or Ask Questions