Abstract:Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/}
Abstract:Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.
Abstract:In this paper, we propose a new influence spread model, namely, Complementary\&Competitive Independent Cascade (C$^2$IC) model. C$^2$IC model generalizes three well known influence model, i.e., influence boosting (IB) model, campaign oblivious (CO)IC model and the IC-N (IC model with negative opinions) model. This is the first model that considers both complementary and competitive influence spread comprehensively under multi-agent environment. Correspondingly, we propose the Complementary\&Competitive influence maximization (C$^2$IM) problem. Given an ally seed set and a rival seed set, the C$^2$IM problem aims to select a set of assistant nodes that can boost the ally spread and prevent the rival spread concurrently. We show the problem is NP-hard and can generalize the influence boosting problem and the influence blocking problem. With classifying the different cascade priorities into 4 cases by the monotonicity and submodularity (M\&S) holding conditions, we design 4 algorithms respectively, with theoretical approximation bounds provided. We conduct extensive experiments on real social networks and the experimental results demonstrate the effectiveness of the proposed algorithms. We hope this work can inspire abundant future exploration for constructing more generalized influence models that help streamline the works of this area.