Multi-Track MusicLDM: Towards Universal Music Generation with Latent Diffusion Models

Welcome to the demo page for Multi-Track MusicLDM (MT-MusicLDM). Our model builds upon the foundation of latent diffusion models and introduces advancements that address the complexities of music composition, particularly in multi-track arrangements.

Our model extends the MusicLDM—a latent diffusion model for music—into a multi-track generative model. By learning the joint probability of tracks sharing a context, our model is capable of generating music across several tracks that correspond well to each other, either conditionally or unconditionally. Additionally, our model is capable of arrangement generation, where the model can generate any subset of tracks given the others (e.g., generating a piano track complementing given bass and drum tracks).

On this page, we present the generation demos of MT-MusicLD in four different scenarios: Total generation, audio-conditioned generation, text-conditioned generation, and arrangement generation.

Additionally, in the end if this page as P.S., we present a results for the source separation algorithm that did not perform as well as our baseline. For a detailed review see below.

Total Generation


Audio Conditioning Generation


Text Conditioning Generation

"Energetic Music"
"Soft Music"

Arrangement Generation

Arrangement Generation B: Bass


Arrangement Generation D: Drums


Arrangement Generation G: Guitar


Arrangement Generation P: Piano


Arrangement Generation BD: Bass and Drums


Arrangement Generation BG: Bass and Guitar


Arrangement Generation BP: Bass and Piano


Arrangement Generation DG: Drums and Guitar


Arrangement Generation PD: Drums and Piano


Arrangement Generation GP: Guitar and Piano


Arrangement Generation BDG: Bassm Drum and Guitar


Arrangement Generation BDP: Bassm Drum and Piano


Arrangement Generation BGP: Bassm Guitar and Piano


Arrangement Generation DGP: Drums Guitar and Piano


P.S. Source Seperation

Seperations with Dirac algorithm


Seperations with Gaussian algorithm


Source Separation

Audio source separation refers to the process of isolating individual sound elements from a mixture of sounds. This method is critical in numerous areas, particularly in music production, where it facilitates the isolation of single instruments from a complete mix. The primary challenge involves accurately identifying and extracting the intended source without introducing noise or degrading the audio quality.

MSDM[1] introduced a diffusion-based multi source generative model that is capable of both music synthesis and source separation within a singular framework, operating directly on raw waveform. This model employs a diffusion-based generative approach, trained via denoising score-matching[2], to learn the priors of stems that share contextual relationships. The fundamental principle of score matching is to approximate the "score" function of the target distribution \(p(x)\)—specifically, \(\nabla_x \log p(x)\)—rather than the distribution itself. The denoising process in such models follows a probability flow ordinary differential equation (ODE)[3]: \begin{equation} \mathrm{d}x(n) = \sigma(n) \nabla_{x(n)} \log p(x(n) | \sigma(n)) \, \mathrm{d}n, \label{denoising_score} \end{equation} where, \(n\) represents the diffusion step, and the parameter \(\sigma(n)\) controls the intensity of noise at each step.

The MSDM framework defines source separation as a particular case of conditional generation, wherein the model is required to estimate the score of the posterior distribution with given mixture \(\nabla_{x(n)} \log p(x(n)|x_{\textit{mixture}})\). The authors posit that in diffusion-based generative approaches to source separation, learning an explicit likelihood model is redundant; this is because the relationship between the clean source \(x\) and the mixture \(x_{\textit{mixture}}\) can be effectively characterized by a straightforward summation function. MSDM details two methods for approximating the posterior score function: "MSDM Dirac" and "MSDM Gaussian", the detailed discussion of which is omitted here for brevity. In our study, we have adopted both method utilized by MSDM for source separation, following the authors at [4], by replacing the score matching model with the DDIM denoising model. This replacement is predicated on the equivalence delineated in Song et al.[5], which indicates that their deterministic DDIM sampler can be expressed as the Euler integration of the ODE: \begin{equation} \mathrm{d}x(n) = \epsilon_\theta \left( \frac{x(n)}{\sqrt{\sigma(n)^2+1}} \right) \mathrm{d}\sigma(n), \label{denoising_score_to_DDIM} \end{equation} where \(\epsilon_\theta\) represents the DDPM model trained to predict the normalized noise vector. Moreover, we adapted both denoising algorithms presented in MSDM to our DDIM paradigm by translating it into the \(z\) latent space, incorporating necessary modifications and 3D integration.

Table 1: SI-SDRi for Source Separation on Slakh2100
Algorithm Bass Drums Guitar Piano
MT-MusicLDM (Dirac) 3.26 3.63 2.79 2.58
MT-MusicLDM (Gaussian) 3.23 3.07 1.97 2.53
MSDM (Dirac) 17.12 18.68 15.38 14.73
MSDM (Gaussian) 13.93 17.92 14.19 12.11
Following the achievements of MSDM, we tested our model on the source separation task utilizing the Slakh test dataset. Notwithstanding the fact that our model demonstrates some non-zero separation capabilities, as depicted in Table 1, it is evident that it significantly underperforms in comparison to the benchmarks set by MSDM. Conditioned on the mixed signal processing algorithms from MSDM, namely "Dirac" and "Gaussian," our model was capable of generating stems that bear a close resemblance to the original tracks in terms of auditory perception. However, there is a noticeable issue of bleeding among the tracks and occasion generation of audio content that deviates from the original compositions. This phenomenon is particularly pronounced in the case of guitar and piano tracks, as indicated by their lower scores in the referenced table. We explain the limitation of our model in source separation by its operation within a latent space, rather than directly on waveform. In this space, the relationship between mixture and separated tracks does not follow a linear pattern, that is pivotal for the algorithms "Dirac" and "Gaussian" algorithms. We are actively seeking for the solution for the separation methods that can accommodate the nonlinear case and expect to show its resolution in the future publications.

References

  1. Giorgio Mariani and Irene Tallini and Emilian Postolache and Michele Mancusi and Luca Cosmo and Emanuele Rodolà: Multi-Source Diffusion Models for Simultaneous Music Generation and Separation, arXiv:2302.02257, 2024.
  2. Song, Yang and Ermon, Stefano: Generative Modeling by Estimating Gradients of the Data Distribution, NIPS, 2019.
  3. Yang Song, Jascha Sohl{-}Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole: Score-Based Generative Modeling through Stochastic Differential Equations, ICLR, 2021.
  4. Tero Karras and Miika Aittala and Timo Aila and Samuli Laine: Elucidating the Design Space of Diffusion-Based Generative Models, arXiv:2206.00364, 2022.
  5. Jiaming Song, Chenlin Meng, Stefano Ermon: Denoising Diffusion Implicit Models, ICLR, 2021.