Examples of audio samples used in the listening test
Presented below are samples from the various systems used in the listening study. Output of all but one of the systems were passed through our Spectrogram Generator to maintain similar quality. The Hierarchical Encodec Baseline generates audio directly and thus didn’t need the Spectrogram Generator.
GaMaDHaNi (autoregressive variant)
GaMaDHaNi (diffusion variant)
Non-hierachical baseline
Hierarchical Encodec baseline
Ground truth (resynthesized with Spectrogram Generator + Griffin-Lim)
Diversity in Generation
Hierarchical encodec baseline
This model has a tendency to hold the same note.
Example 1
Example 2
GaMaDHaNi
Our proposed methods are able to generate both slow and fast movements, resulting in more variety.
GaMaDHaNi (diffusion variant)
GaMaDHaNi (autoregressive variant)
Consistency of vocal timbre
Note: audio samples are all generated unconditionally (as defined in section 4.4), only the Hierarchical baseline’s samples in this section are generated in the primed generation setting (defined in section 5.1)
Non-hierarchical baseline
Non-hierarchical baseline sometimes changes the timbre of voice.
Example 1
Example 2
Hierarchical baseline
The hierarchical baseline changes voice during the case of primed generation (4s of input audio is fed to the model and the sequence is continued by the model, see section 5.1 in paper)
Example 1
Example 2
GaMaDHaNi
Our proposed methods are able to maintain vocal timbre