Abstract:Large models for text-to-music generation have achieved significant progress, facilitating the creation of high-quality and varied musical compositions from provided text prompts. However, input text prompts may not precisely capture user requirements, particularly when the objective is to generate music that embodies a specific concept derived from a designated reference collection. In this paper, we propose a novel method for customized text-to-music generation, which can capture the concept from a two-minute reference music and generate a new piece of music conforming to the concept. We achieve this by fine-tuning a pretrained text-to-music model using the reference music. However, directly fine-tuning all parameters leads to overfitting issues. To address this problem, we propose a Pivotal Parameters Tuning method that enables the model to assimilate the new concept while preserving its original generative capabilities. Additionally, we identify a potential concept conflict when introducing multiple concepts into the pretrained model. We present a concept enhancement strategy to distinguish multiple concepts, enabling the fine-tuned model to generate music incorporating either individual or multiple concepts simultaneously. Since we are the first to work on the customized music generation task, we also introduce a new dataset and evaluation protocol for the new task. Our proposed Jen1-DreamStyler outperforms several baselines in both qualitative and quantitative evaluations. Demos will be available at https://www.jenmusic.ai/research#DreamStyler.
Abstract:With rapid advances in generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation from scratch. However, finer-grained control over multi-track generation remains an open challenge. Existing models exhibit strong raw generation capability but lack the flexibility to compose separate tracks and combine them in a controllable manner, differing from typical workflows of human composers. To address this issue, we propose JEN-1 Composer, a unified framework to efficiently model marginal, conditional, and joint distributions over multi-track music via a single model. JEN-1 Composer framework exhibits the capacity to seamlessly incorporate any diffusion-based music generation system, \textit{e.g.} Jen-1, enhancing its capacity for versatile multi-track music generation. We introduce a curriculum training strategy aimed at incrementally instructing the model in the transition from single-track generation to the flexible generation of multi-track combinations. During the inference, users have the ability to iteratively produce and choose music tracks that meet their preferences, subsequently creating an entire musical composition incrementally following the proposed Human-AI co-composition workflow. Quantitative and qualitative assessments demonstrate state-of-the-art performance in controllable and high-fidelity multi-track music synthesis. The proposed JEN-1 Composer represents a significant advance toward interactive AI-facilitated music creation and composition. Demos will be available at https://www.jenmusic.ai/audio-demos.
Abstract:Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data without being affected by contaminated audio signals. Meanwhile, we also utilize a foundation audio classification model to discern audio semantics. Considering the audio tags provided by the audio foundation model are noisy, associating object masks with audio tags is not trivial. Thus, in the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects. Here, we construct an audio-visual tree based on the hierarchical correspondence between sounds and object categories. We then examine the label concurrency between the localized objects and classified audio tags by tracing the audio-visual tree. With AVIS, we can effectively segment real-sounding objects. Extensive experiments demonstrate the superiority of our method on AVS datasets, particularly in scenarios involving background noise. Our project website is https://yenanliu.github.io/AVSS.github.io/.
Abstract:Music generation has attracted growing interest with the advancement of deep generative models. However, generating music conditioned on textual descriptions, known as text-to-music, remains challenging due to the complexity of musical structures and high sampling rate requirements. Despite the task's significance, prevailing generative models exhibit limitations in music quality, computational efficiency, and generalization. This paper introduces JEN-1, a universal high-fidelity model for text-to-music generation. JEN-1 is a diffusion model incorporating both autoregressive and non-autoregressive training. Through in-context learning, JEN-1 performs various generation tasks including text-guided music generation, music inpainting, and continuation. Evaluations demonstrate JEN-1's superior performance over state-of-the-art methods in text-music alignment and music quality while maintaining computational efficiency. Our demos are available at http://futureverse.com/research/jen/demos/jen1
Abstract:The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
Abstract:We study the backward compatible problem for person re-identification (Re-ID), which aims to constrain the features of an updated new model to be comparable with the existing features from the old model in galleries. Most of the existing works adopt distillation-based methods, which focus on pushing new features to imitate the distribution of the old ones. However, the distillation-based methods are intrinsically sub-optimal since it forces the new feature space to imitate the inferior old feature space. To address this issue, we propose the Ranking-based Backward Compatible Learning (RBCL), which directly optimizes the ranking metric between new features and old features. Different from previous methods, RBCL only pushes the new features to find best-ranking positions in the old feature space instead of strictly alignment, and is in line with the ultimate goal of backward retrieval. However, the sharp sigmoid function used to make the ranking metric differentiable also incurs the gradient vanish issue, therefore stems the ranking refinement during the later period of training. To address this issue, we propose the Dynamic Gradient Reactivation (DGR), which can reactivate the suppressed gradients by adding dynamic computed constant during forward step. To further help targeting the best-ranking positions, we include the Neighbor Context Agents (NCAs) to approximate the entire old feature space during training. Unlike previous works which only test on the in-domain settings, we make the first attempt to introduce the cross-domain settings (including both supervised and unsupervised), which are more meaningful and difficult. The experimental results on all five settings show that the proposed RBCL outperforms previous state-of-the-art methods by large margins under all settings.
Abstract:The fashion industry has diverse applications in multi-modal image generation and editing. It aims to create a desired high-fidelity image with the multi-modal conditional signal as guidance. Most existing methods learn different condition guidance controls by introducing extra models or ignoring the style prior knowledge, which is difficult to handle multiple signal combinations and faces a low-fidelity problem. In this paper, we adapt both style prior knowledge and flexibility of multi-modal control into one unified two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion design. It decouples style codes in both spatial and semantic dimensions to guarantee high-fidelity image generation in the first stage. M6-Fashion utilizes self-correction for the non-autoregressive generation to improve inference speed, enhance holistic consistency, and support various signal controls. Extensive experiments on a large-scale clothing dataset M2C-Fashion demonstrate superior performances on various image generation and editing tasks. M6-Fashion model serves as a highly potential AI designer for the fashion industry.
Abstract:In this paper, we focus on the unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capturing high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To better discover the temporal information, we additionally force the inter-frame consistency from both feature level and affinity matrix level. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.
Abstract:Conventional face super-resolution methods usually assume testing low-resolution (LR) images lie in the same domain as the training ones. Due to different lighting conditions and imaging hardware, domain gaps between training and testing images inevitably occur in many real-world scenarios. Neglecting those domain gaps would lead to inferior face super-resolution (FSR) performance. However, how to transfer a trained FSR model to a target domain efficiently and effectively has not been investigated. To tackle this problem, we develop a Domain-Aware Pyramid-based Face Super-Resolution network, named DAP-FSR network. Our DAP-FSR is the first attempt to super-resolve LR faces from a target domain by exploiting only a pair of high-resolution (HR) and LR exemplar in the target domain. To be specific, our DAP-FSR firstly employs its encoder to extract the multi-scale latent representations of the input LR face. Considering only one target domain example is available, we propose to augment the target domain data by mixing the latent representations of the target domain face and source domain ones, and then feed the mixed representations to the decoder of our DAP-FSR. The decoder will generate new face images resembling the target domain image style. The generated HR faces in turn are used to optimize our decoder to reduce the domain gap. By iteratively updating the latent representations and our decoder, our DAP-FSR will be adapted to the target domain, thus achieving authentic and high-quality upsampled HR faces. Extensive experiments on three newly constructed benchmarks validate the effectiveness and superior performance of our DAP-FSR compared to the state-of-the-art.
Abstract:Labeling pixel-level masks for fine-grained semantic segmentation tasks, e.g. human parsing, remains a challenging task. The ambiguous boundary between different semantic parts and those categories with similar appearance usually are confusing, leading to unexpected noises in ground truth masks. To tackle the problem of learning with label noises, this work introduces a purification strategy, called Self-Correction for Human Parsing (SCHP), to progressively promote the reliability of the supervised labels as well as the learned models. In particular, starting from a model trained with inaccurate annotations as initialization, we design a cyclically learning scheduler to infer more reliable pseudo-masks by iteratively aggregating the current learned model with the former optimal one in an online manner. Besides, those correspondingly corrected labels can in turn to further boost the model performance. In this way, the models and the labels will reciprocally become more robust and accurate during the self-correction learning cycles. Benefiting from the superiority of SCHP, we achieve the best performance on two popular single-person human parsing benchmarks, including LIP and Pascal-Person-Part datasets. Our overall system ranks 1st in CVPR2019 LIP Challenge. Code is available at https://github.com/PeikeLi/Self-Correction-Human-Parsing.