Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jingwei Zhao

FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

Mar 26, 2025

Jinwei Li, Huan-ang Gao, Wenyi Li, Haohan Chi, Chenyu Liu, Chenxi Du, Yiqian Liu, Mingju Gao, Guiyu Zhang, Zongzheng Zhang(+6 more)

Abstract:With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods.

* Project page:https://fb-4d.c7w.tech/

Via

Access Paper or Ask Questions

Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement

Aug 27, 2024

Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Ye Wang

Abstract:Large language models have shown significant capabilities across various domains, including symbolic music generation. However, leveraging these pre-trained models for controllable music arrangement tasks, each requiring different forms of musical information as control, remains a novel challenge. In this paper, we propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model for multiple multi-track arrangement tasks, including band arrangement, piano reduction, drum arrangement, and voice separation. Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines across all four tasks. Furthermore, through additional experiments on probing analysis, we show the pre-training phase equips the model with essential knowledge to understand musical conditions, which is hard to acquired solely through task-specific fine-tuning.

* Submitted to AAAI 2025

Via

Access Paper or Ask Questions

SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing

Mar 28, 2024

Xiaowei Song, Jv Zheng, Shiran Yuan, Huan-ang Gao, Jingwei Zhao, Xiang He, Weihao Gu, Hao Zhao

Figure 1 for SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing

Figure 2 for SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing

Figure 3 for SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing

Figure 4 for SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing

Abstract:In this paper, we present a Scale-adaptive method for Anti-aliasing Gaussian Splatting (SA-GS). While the state-of-the-art method Mip-Splatting needs modifying the training procedure of Gaussian splatting, our method functions at test-time and is training-free. Specifically, SA-GS can be applied to any pretrained Gaussian splatting field as a plugin to significantly improve the field's anti-alising performance. The core technique is to apply 2D scale-adaptive filters to each Gaussian during test time. As pointed out by Mip-Splatting, observing Gaussians at different frequencies leads to mismatches between the Gaussian scales during training and testing. Mip-Splatting resolves this issue using 3D smoothing and 2D Mip filters, which are unfortunately not aware of testing frequency. In this work, we show that a 2D scale-adaptive filter that is informed of testing frequency can effectively match the Gaussian scale, thus making the Gaussian primitive distribution remain consistent across different testing frequencies. When scale inconsistency is eliminated, sampling rates smaller than the scene frequency result in conventional jaggedness, and we propose to integrate the projected 2D Gaussian within each pixel during testing. This integration is actually a limiting case of super-sampling, which significantly improves anti-aliasing performance over vanilla Gaussian Splatting. Through extensive experiments using various settings and both bounded and unbounded scenes, we show SA-GS performs comparably with or better than Mip-Splatting. Note that super-sampling and integration are only effective when our scale-adaptive filtering is activated. Our codes, data and models are available at https://github.com/zsy1987/SA-GS.

* Project page: https://kevinsong729.github.io/project-pages/SA-GS/ Code: https://github.com/zsy1987/SA-GS

Via

Access Paper or Ask Questions

AccoMontage-3: Full-Band Accompaniment Arrangement via Sequential Style Transfer and Multi-Track Function Prior

Oct 25, 2023

Jingwei Zhao, Gus Xia, Ye Wang

Abstract:We propose AccoMontage-3, a symbolic music automation system capable of generating multi-track, full-band accompaniment based on the input of a lead melody with chords (i.e., a lead sheet). The system contains three modular components, each modelling a vital aspect of full-band composition. The first component is a piano arranger that generates piano accompaniment for the lead sheet by transferring texture styles to the chords using latent chord-texture disentanglement and heuristic retrieval of texture donors. The second component orchestrates the piano accompaniment score into full-band arrangement according to the orchestration style encoded by individual track functions. The third component, which connects the previous two, is a prior model characterizing the global structure of orchestration style over the whole piece of music. From end to end, the system learns to generate full-band accompaniment in a self-supervised fashion, applying style transfer at two levels of polyphonic composition: texture and orchestration. Experiments show that our system outperforms the baselines significantly, and the modular design offers effective controls in a musically meaningful way.

Via

Access Paper or Ask Questions

Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls

Jul 19, 2023

Lejun Min, Junyan Jiang, Gus Xia, Jingwei Zhao

Abstract:We propose Polyffusion, a diffusion model that generates polyphonic music scores by regarding music as image-like piano roll representations. The model is capable of controllable music generation with two paradigms: internal control and external control. Internal control refers to the process in which users pre-define a part of the music and then let the model infill the rest, similar to the task of masked music generation (or music inpainting). External control conditions the model with external yet related information, such as chord, texture, or other features, via the cross-attention mechanism. We show that by using internal and external controls, Polyffusion unifies a wide range of music creation tasks, including melody generation given accompaniment, accompaniment generation given melody, arbitrary music segment inpainting, and music arrangement given chords or textures. Experimental results show that our model significantly outperforms existing Transformer and sampling-based baselines, and using pre-trained disentangled representations as external conditions yields more effective controls.

* In Proceedings of the 24th Conference of the International Society for Music Information Retrieval (ISMIR 2023), Milan, Italy

Via

Access Paper or Ask Questions

Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement

Jun 02, 2023

Jingwei Zhao, Gus Xia, Ye Wang

Figure 1 for Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement

Figure 2 for Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement

Figure 3 for Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement

Figure 4 for Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement

Abstract:Music rearrangement is a common music practice of reconstructing and reconceptualizing a piece using new composition or instrumentation styles, which is also an important task of automatic music generation. Existing studies typically model the mapping from a source piece to a target piece via supervised learning. In this paper, we tackle rearrangement problems via self-supervised learning, in which the mapping styles can be regarded as conditions and controlled in a flexible way. Specifically, we are inspired by the representation disentanglement idea and propose Q&A, a query-based algorithm for multi-track music rearrangement under an encoder-decoder framework. Q&A learns both a content representation from the mixture and function (style) representations from each individual track, while the latter queries the former in order to rearrange a new piece. Our current model focuses on popular music and provides a controllable pathway to four scenarios: 1) re-instrumentation, 2) piano cover generation, 3) orchestration, and 4) voice separation. Experiments show that our query system achieves high-quality rearrangement results with delicate multi-track structures, significantly outperforming the baselines.

* Accepted by IJCAI 2023 Special Track for AI the Arts and Creativity

Via

Access Paper or Ask Questions

Domain Adversarial Training on Conditional Variational Auto-Encoder for Controllable Music Generation

Sep 15, 2022

Jingwei Zhao, Gus Xia, Ye Wang

Figure 1 for Domain Adversarial Training on Conditional Variational Auto-Encoder for Controllable Music Generation

Figure 2 for Domain Adversarial Training on Conditional Variational Auto-Encoder for Controllable Music Generation

Figure 3 for Domain Adversarial Training on Conditional Variational Auto-Encoder for Controllable Music Generation

Figure 4 for Domain Adversarial Training on Conditional Variational Auto-Encoder for Controllable Music Generation

Abstract:The variational auto-encoder has become a leading framework for symbolic music generation, and a popular research direction is to study how to effectively control the generation process. A straightforward way is to control a model using different conditions during inference. However, in music practice, conditions are usually sequential (rather than simple categorical labels), involving rich information that overlaps with the learned representation. Consequently, the decoder gets confused about whether to "listen to" the latent representation or the condition, and sometimes just ignores the condition. To solve this problem, we leverage domain adversarial training to disentangle the representation from condition cues for better control. Specifically, we propose a condition corruption objective that uses the representation to denoise a corrupted condition. Minimized by a discriminator and maximized by the VAE encoder, this objective adversarially induces a condition-invariant representation. In this paper, we focus on the task of melody harmonization to illustrate our idea, while our methodology can be generalized to other controllable generative tasks. Demos and experiments show that our methodology facilitates not only condition-invariant representation learning but also higher-quality controllability compared to baselines.

* Accepted by ISMIR 2022

Via

Access Paper or Ask Questions

Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

Sep 15, 2022

Jingwei Zhao, Gus Xia, Ye Wang

Figure 1 for Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

Figure 2 for Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

Figure 3 for Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

Figure 4 for Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

Abstract:We propose Beat Transformer, a novel Transformer encoder architecture for joint beat and downbeat tracking. Different from previous models that track beats solely based on the spectrogram of an audio mixture, our model deals with demixed spectrograms with multiple instrument channels. This is inspired by the fact that humans perceive metrical structures from richer musical contexts, such as chord progression and instrumentation. To this end, we develop a Transformer model with both time-wise attention and instrument-wise attention to capture deep-buried metrical cues. Moreover, our model adopts a novel dilated self-attention mechanism, which achieves powerful hierarchical modelling with only linear complexity. Experiments demonstrate a significant improvement in demixed beat tracking over the non-demixed version. Also, Beat Transformer achieves up to 4% point improvement in downbeat tracking accuracy over the TCN architectures. We further discover an interpretable attention pattern that mirrors our understanding of hierarchical metrical structures.

* Accepted by ISMIR 2022

Via

Access Paper or Ask Questions

AccoMontage: Accompaniment Arrangement via Phrase Selection and Style Transfer

Aug 25, 2021

Jingwei Zhao, Gus Xia

Figure 1 for AccoMontage: Accompaniment Arrangement via Phrase Selection and Style Transfer

Figure 2 for AccoMontage: Accompaniment Arrangement via Phrase Selection and Style Transfer

Figure 3 for AccoMontage: Accompaniment Arrangement via Phrase Selection and Style Transfer

Figure 4 for AccoMontage: Accompaniment Arrangement via Phrase Selection and Style Transfer

Abstract:Accompaniment arrangement is a difficult music generation task involving intertwined constraints of melody, harmony, texture, and music structure. Existing models are not yet able to capture all these constraints effectively, especially for long-term music generation. To address this problem, we propose AccoMontage, an accompaniment arrangement system for whole pieces of music through unifying phrase selection and neural style transfer. We focus on generating piano accompaniments for folk/pop songs based on a lead sheet (i.e., melody with chord progression). Specifically, AccoMontage first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure learning-based approaches, AccoMontage introduces a novel hybrid pathway, in which rule-based optimization and deep learning are both leveraged to complement each other for high-quality generation. Experiments show that our model generates well-structured accompaniment with delicate texture, significantly outperforming the baselines.

* Accepted by ISMIR 2021

Via

Access Paper or Ask Questions

Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise

Dec 10, 2020

Pengfei Chen, Junjie Ye, Guangyong Chen, Jingwei Zhao, Pheng-Ann Heng

Figure 1 for Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise

Figure 2 for Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise

Figure 3 for Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise

Figure 4 for Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise

Abstract:Supervised learning under label noise has seen numerous advances recently, while existing theoretical findings and empirical results broadly build up on the class-conditional noise (CCN) assumption that the noise is independent of input features given the true label. In this work, we present a theoretical hypothesis testing and prove that noise in real-world dataset is unlikely to be CCN, which confirms that label noise should depend on the instance and justifies the urgent need to go beyond the CCN assumption.The theoretical results motivate us to study the more general and practical-relevant instance-dependent noise (IDN). To stimulate the development of theory and methodology on IDN, we formalize an algorithm to generate controllable IDN and present both theoretical and empirical evidence to show that IDN is semantically meaningful and challenging. As a primary attempt to combat IDN, we present a tiny algorithm termed self-evolution average label (SEAL), which not only stands out under IDN with various noise fractions, but also improves the generalization on real-world noise benchmark Clothing1M. Our code is released. Notably, our theoretical analysis in Section 2 provides rigorous motivations for studying IDN, which is an important topic that deserves more research attention in future.

* Accepted by AAAI 2021

Via

Access Paper or Ask Questions