MT-MusicLDM

Total Generation

Audio Conditioning Generation

Text Conditioning Generation

"Energetic Music"

"Soft Music"

Arrangement Generation

Arrangement Generation B: Bass

Arrangement Generation D: Drums

Arrangement Generation G: Guitar

Arrangement Generation P: Piano

Arrangement Generation BD: Bass and Drums

Arrangement Generation BG: Bass and Guitar

Arrangement Generation BP: Bass and Piano

Arrangement Generation DG: Drums and Guitar

Arrangement Generation PD: Drums and Piano

Arrangement Generation GP: Guitar and Piano

Arrangement Generation BDG: Bassm Drum and Guitar

Arrangement Generation BDP: Bassm Drum and Piano

Arrangement Generation BGP: Bassm Guitar and Piano

Arrangement Generation DGP: Drums Guitar and Piano

P.S. Source Seperation

Seperations with Dirac algorithm

Seperations with Gaussian algorithm

Source Separation

Audio source separation refers to the process of isolating individual sound elements from a mixture of sounds. This method is critical in numerous areas, particularly in music production, where it facilitates the isolation of single instruments from a complete mix. The primary challenge involves accurately identifying and extracting the intended source without introducing noise or degrading the audio quality.

MSDM^[1] introduced a diffusion-based multi source generative model that is capable of both music synthesis and source separation within a singular framework, operating directly on raw waveform. This model employs a diffusion-based generative approach, trained via denoising score-matching^[2], to learn the priors of stems that share contextual relationships. The fundamental principle of score matching is to approximate the "score" function of the target distribution \(p(x)\)—specifically, \(\nabla_x \log p(x)\)—rather than the distribution itself. The denoising process in such models follows a probability flow ordinary differential equation (ODE)^[3]: \begin{equation} \mathrm{d}x(n) = \sigma(n) \nabla_{x(n)} \log p(x(n) | \sigma(n)) \, \mathrm{d}n, \label{denoising_score} \end{equation} where, \(n\) represents the diffusion step, and the parameter \(\sigma(n)\) controls the intensity of noise at each step.

The MSDM framework defines source separation as a particular case of conditional generation, wherein the model is required to estimate the score of the posterior distribution with given mixture \(\nabla_{x(n)} \log p(x(n)|x_{\textit{mixture}})\). The authors posit that in diffusion-based generative approaches to source separation, learning an explicit likelihood model is redundant; this is because the relationship between the clean source \(x\) and the mixture \(x_{\textit{mixture}}\) can be effectively characterized by a straightforward summation function. MSDM details two methods for approximating the posterior score function: "MSDM Dirac" and "MSDM Gaussian", the detailed discussion of which is omitted here for brevity. In our study, we have adopted both method utilized by MSDM for source separation, following the authors at ^[4], by replacing the score matching model with the DDIM denoising model. This replacement is predicated on the equivalence delineated in Song et al.^[5], which indicates that their deterministic DDIM sampler can be expressed as the Euler integration of the ODE: \begin{equation} \mathrm{d}x(n) = \epsilon_\theta \left( \frac{x(n)}{\sqrt{\sigma(n)^2+1}} \right) \mathrm{d}\sigma(n), \label{denoising_score_to_DDIM} \end{equation} where \(\epsilon_\theta\) represents the DDPM model trained to predict the normalized noise vector. Moreover, we adapted both denoising algorithms presented in MSDM to our DDIM paradigm by translating it into the \(z\) latent space, incorporating necessary modifications and 3D integration.

Table 1: SI-SDRi for Source Separation on Slakh2100
Algorithm	Bass	Drums	Guitar	Piano
MT-MusicLDM (Dirac)	3.26	3.63	2.79	2.58
MT-MusicLDM (Gaussian)	3.23	3.07	1.97	2.53
MSDM (Dirac)	17.12	18.68	15.38	14.73
MSDM (Gaussian)	13.93	17.92	14.19	12.11

Following the achievements of MSDM, we tested our model on the source separation task utilizing the Slakh test dataset. Notwithstanding the fact that our model demonstrates some non-zero separation capabilities, as depicted in Table 1, it is evident that it significantly underperforms in comparison to the benchmarks set by MSDM. Conditioned on the mixed signal processing algorithms from MSDM, namely "Dirac" and "Gaussian," our model was capable of generating stems that bear a close resemblance to the original tracks in terms of auditory perception. However, there is a noticeable issue of bleeding among the tracks and occasion generation of audio content that deviates from the original compositions. This phenomenon is particularly pronounced in the case of guitar and piano tracks, as indicated by their lower scores in the referenced table. We explain the limitation of our model in source separation by its operation within a latent space, rather than directly on waveform. In this space, the relationship between mixture and separated tracks does not follow a linear pattern, that is pivotal for the algorithms "Dirac" and "Gaussian" algorithms. We are actively seeking for the solution for the separation methods that can accommodate the nonlinear case and expect to show its resolution in the future publications.

References