4.1 Dataset

Artifacts in the dataset

Our dataset included vocal audio separated from mixed audio performance of Hindustani vocal music containing voice, melodic accompaniment (sarangi or harmonium) and rhythmic accompaniment (tabla) as described in section 4.1. In this section we note some instances of incorrect data and, as a result, samples of incorrect generation.

1. Leaked Sarangi (stringed melodic instrument) sound

The source separation model, Demucs finds it particularly hard to separate the sarangi sound from the voice. This could be due to a combination of our data being out of distribution for demucs and the similarity of the sarangi’s timbre to the voice. The source separated audio shown below is the vocal stem which is fed into our model for training.

Examples of Sarangi in the dataset

Example 1: Only Sarangi playing with tabla

Original Audio

Source Separated Audio

Example 2: Sarangi playing with voice and tabla

Original Audio

Source Separated Audio

Examples of Sarangi in generated samples

Example 1: generated by GaMaDHaNi (diffusion variant); (between 0-4s)

Example 2: generated by non-hierarchical baseline; (between 1-2s)

2. Speech

There are some instances of speech in the dataset as well. As a result, some generations seem to have a hybrid of speech and singing sounds.

Example of speech in the dataset

Original Audio

Source Separated Audio

Example of speech-like sounds in generation

This example seems to be a mix between speech and singing.

Generated sample by GaMaDHaNi (diffusion variant); This feels like an interesting mixture of speech and singing sounds.