Total Generation
Audio Conditioning Generation
Text Conditioning Generation
Arrangement Generation
Arrangement Generation B: Bass
Arrangement Generation D: Drums
Arrangement Generation G: Guitar
Arrangement Generation P: Piano
Arrangement Generation BD: Bass and Drums
Arrangement Generation BG: Bass and Guitar
Arrangement Generation BP: Bass and Piano
Arrangement Generation DG: Drums and Guitar
Arrangement Generation PD: Drums and Piano
Arrangement Generation GP: Guitar and Piano
Arrangement Generation BDG: Bassm Drum and Guitar
Arrangement Generation BDP: Bassm Drum and Piano
Arrangement Generation BGP: Bassm Guitar and Piano
Arrangement Generation DGP: Drums Guitar and Piano
P.S. Source Seperation
Seperations with Dirac algorithm
Seperations with Gaussian algorithm
Source Separation
Audio source separation refers to the process of isolating individual sound elements from a mixture of sounds. This method is critical in numerous areas, particularly in music production, where it facilitates the isolation of single instruments from a complete mix. The primary challenge involves accurately identifying and extracting the intended source without introducing noise or degrading the audio quality.
MSDM[1] introduced a diffusion-based multi source generative model that is capable of both music synthesis and source separation within a singular framework, operating directly on raw waveform. This model employs a diffusion-based generative approach, trained via denoising score-matching[2], to learn the priors of stems that share contextual relationships. The fundamental principle of score matching is to approximate the "score" function of the target distribution \(p(x)\)—specifically, \(\nabla_x \log p(x)\)—rather than the distribution itself. The denoising process in such models follows a probability flow ordinary differential equation (ODE)[3]: \begin{equation} \mathrm{d}x(n) = \sigma(n) \nabla_{x(n)} \log p(x(n) | \sigma(n)) \, \mathrm{d}n, \label{denoising_score} \end{equation} where, \(n\) represents the diffusion step, and the parameter \(\sigma(n)\) controls the intensity of noise at each step.
The MSDM framework defines source separation as a particular case of conditional generation, wherein the model is required to estimate the score of the posterior distribution with given mixture \(\nabla_{x(n)} \log p(x(n)|x_{\textit{mixture}})\). The authors posit that in diffusion-based generative approaches to source separation, learning an explicit likelihood model is redundant; this is because the relationship between the clean source \(x\) and the mixture \(x_{\textit{mixture}}\) can be effectively characterized by a straightforward summation function. MSDM details two methods for approximating the posterior score function: "MSDM Dirac" and "MSDM Gaussian", the detailed discussion of which is omitted here for brevity. In our study, we have adopted both method utilized by MSDM for source separation, following the authors at [4], by replacing the score matching model with the DDIM denoising model. This replacement is predicated on the equivalence delineated in Song et al.[5], which indicates that their deterministic DDIM sampler can be expressed as the Euler integration of the ODE: \begin{equation} \mathrm{d}x(n) = \epsilon_\theta \left( \frac{x(n)}{\sqrt{\sigma(n)^2+1}} \right) \mathrm{d}\sigma(n), \label{denoising_score_to_DDIM} \end{equation} where \(\epsilon_\theta\) represents the DDPM model trained to predict the normalized noise vector. Moreover, we adapted both denoising algorithms presented in MSDM to our DDIM paradigm by translating it into the \(z\) latent space, incorporating necessary modifications and 3D integration.
Algorithm | Bass | Drums | Guitar | Piano |
---|---|---|---|---|
MT-MusicLDM (Dirac) | 3.26 | 3.63 | 2.79 | 2.58 |
MT-MusicLDM (Gaussian) | 3.23 | 3.07 | 1.97 | 2.53 |
MSDM (Dirac) | 17.12 | 18.68 | 15.38 | 14.73 |
MSDM (Gaussian) | 13.93 | 17.92 | 14.19 | 12.11 |
References
- Giorgio Mariani and Irene Tallini and Emilian Postolache and Michele Mancusi and Luca Cosmo and Emanuele Rodolà: Multi-Source Diffusion Models for Simultaneous Music Generation and Separation, arXiv:2302.02257, 2024. ↩
- Song, Yang and Ermon, Stefano: Generative Modeling by Estimating Gradients of the Data Distribution, NIPS, 2019. ↩
- Yang Song, Jascha Sohl{-}Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole: Score-Based Generative Modeling through Stochastic Differential Equations, ICLR, 2021. ↩
- Tero Karras and Miika Aittala and Timo Aila and Samuli Laine: Elucidating the Design Space of Diffusion-Based Generative Models, arXiv:2206.00364, 2022. ↩
- Jiaming Song, Chenlin Meng, Stefano Ermon: Denoising Diffusion Implicit Models, ICLR, 2021. ↩