Our dataset included vocal audio separated from mixed audio performance of Hindustani vocal music containing voice, melodic accompaniment (sarangi or harmonium) and rhythmic accompaniment (tabla) as described in section 4.1. In this section we note some instances of incorrect data and, as a result, samples of incorrect generation.
The source separation model, Demucs finds it particularly hard to separate the sarangi sound from the voice. This could be due to a combination of our data being out of distribution for demucs and the similarity of the sarangi’s timbre to the voice. The source separated audio shown below is the vocal stem which is fed into our model for training.
Original Audio
Source Separated Audio
Original Audio
Source Separated Audio
Example 1: generated by GaMaDHaNi (diffusion variant); (between 0-4s)
Example 2: generated by non-hierarchical baseline; (between 1-2s)
There are some instances of speech in the dataset as well. As a result, some generations seem to have a hybrid of speech and singing sounds.
Original Audio
Source Separated Audio
This example seems to be a mix between speech and singing.
Generated sample by GaMaDHaNi (diffusion variant); This feels like an interesting mixture of speech and singing sounds.