GaMaDHaNi: Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

4.1 Dataset

Artifacts in the dataset

Our dataset included vocal audio separated from mixed audio performance of Hindustani vocal music containing voice, melodic accompaniment (sarangi or harmonium) and rhythmic accompaniment (tabla) as described in section 4.1. In this section we note some instances of incorrect data and, as a result, samples of incorrect generation.

1. Leaked Sarangi (stringed melodic instrument) sound

The source separation model, Demucs finds it particularly hard to separate the sarangi sound from the voice. This could be due to a combination of our data being out of distribution for demucs and the similarity of the sarangi’s timbre to the voice. The source separated audio shown below is the vocal stem which is fed into our model for training.

Examples of Sarangi in the dataset

Example 1: Only Sarangi playing with tabla

Example 2: Sarangi playing with voice and tabla

Examples of Sarangi in generated samples

2. Speech

There are some instances of speech in the dataset as well. As a result, some generations seem to have a hybrid of speech and singing sounds.

Example of speech in the dataset

Example of speech-like sounds in generation

This example seems to be a mix between speech and singing.