GaMaDHaNi: Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

4.4 Human Evaluation on Melodic Quality

Examples of audio samples used in the listening test

Presented below are samples from the various systems used in the listening study. Output of all but one of the systems were passed through our Spectrogram Generator to maintain similar quality. The Hierarchical Encodec Baseline generates audio directly and thus didn’t need the Spectrogram Generator.

GaMaDHaNi (autoregressive variant)

GaMaDHaNi (diffusion variant)

Non-hierachical baseline

Hierarchical Encodec baseline

Ground truth (resynthesized with Spectrogram Generator + Griffin-Lim)

Diversity in Generation

Hierarchical encodec baseline

This model has a tendency to hold the same note.


Our proposed methods are able to generate both slow and fast movements, resulting in more variety.

Consistency of vocal timbre

Note: audio samples are all generated unconditionally (as defined in section 4.4), only the Hierarchical baseline’s samples in this section are generated in the primed generation setting (defined in section 5.1)

Non-hierarchical baseline

Non-hierarchical baseline sometimes changes the timbre of voice.

Hierarchical baseline

The hierarchical baseline changes voice during the case of primed generation (4s of input audio is fed to the model and the sequence is continued by the model, see section 5.1 in paper)


Our proposed methods are able to maintain vocal timbre