Music research

Interactive Music Generation Experiments

Research Abstract

In Hindustani music, a tradition of North Indian Classical music, the melody is loosely constrained by the chosen composition but otherwise largely improvised in accordance with the raga grammar which are often realized at different levels of abstraction and conformity [ 1 , 2 ]. Through this project I wish to develop a human-computer interactive system that enables artists and performers to explore novel creative spaces inspired by the Hindustani idiom. Through this study, I wish to focus on three aspects: music generation, controllability and interaction each with its own motivation as I will specify below.

Generation: Previous works on generative modeling for Hindustani music [ 3 , 4 , 5 , 6 ] approach the problem of generating raga music as midi-like notation. Hindustani music is a largely oral tradition, as a result, reducing the music to a midi-like notation gives up a lot of important information such as note ornamentations, dynamics, pitching etc. Hence, I propose 2 other methods to generate audio of this form: pitch contour based generation and audio waveform based generation. Pitch contour based generation [ 7 , 8 ] can capture intricate melodic movements better than midi due to a finer resolution on the pitch axis; additionally DDSP [ 9 ] can be used to synthesize pitch into realistic sounds. Audio waveform, on the other hand, captures much more detail (timbre, dynamics) than just pitch and is simultaneously more noisy due to the high sample rate, thus making it harder to model [ 10 ]. I propose to encode the waveform using encoders [ 11 , 12 , 13 ] and modeling this encoded sequence. Given their ability to model musical sequences effectively, I use transformers [ 14 , 15 , 16 ] with the objective of sequence continuation, i.e. given an input sequence, the model should construct a continuation that is likely in the given data distribution. The performance can be measured objectively by entropy in addition to hand-crafted metrics typical to Hindustani music such as the adherence to the notes of a raga (melodic mode), the pitch range of input output etc.

Controllability: The idiom of Hindustani music presents a framework or fixed boundary within which improvisation can take place. For instance, a performer always performs with respect to a tonic frequency. By extension, this implies that generations have to be conditioned on a tonic frequency as well to be useful for interaction. Additionally, improvisation usually takes place within a raga framework which would be another conditioning signal for the generation. Apart from these global controls, I also propose to introduce certain aesthetic local controls to control the melodic trajectory of the generations such as controlling the amount of oscillations (gamaka) in the generated melody, the average direction of pitch in the generations (up or down), and the average dynamics of the generation. Introducing controls to generation paves the possibility of a more creatively satisfying experience with the performer left feeling in control when they want to. Based on prior work on controllability [ 16 ], I plan to evaluate this aspect based on correlation metrics between the input attributes and the corresponding extracted attributes from the generated signal.

Interaction: Since this project aims to address the paradigm of human-computer interactive generation, it becomes essential to think about the user experience with this system [ 17 ]. Creativity Support Index [ 18 , 19 , 20 ] is used to measure the ability of a tool to assist a user in creative work. To this effect, it systematically measures six factors including exploration, expressiveness, immersion, enjoyment, results worth effort, and collaboration through user surveys. With the goal of enabling creative exploration, I plan to focus more carefully on aspects of exploration, immersion and enjoyment while also keeping the other factors in mind when relevant. Additionally, interactions with professionals in the field will ensure that this work is relevant to the community of musicians who practice this style of music.

Interactive demos

This is a PoC demo with interactive waveform generation. In this experiment, I use a modified version of Rave (continuous VAE) and msprior together. Rave is used to encode the waveform into more compact representation with a lower sampling rate. These tokens are then fed into msprior, a transformer decoder model that predicts these tokens autoregressively in real time. The data is from the Hindustani Raga Recognition dataset which consists of 116 hours of data, with 30 different ragas and over 55 vocal artists. The left and right audio channels is my voice input and the left audio channel is the model responding to my voice. This is a work in progress.

Artwork credit: craiyon.com

References

[1] W. van der Meer. Hindustani Music in the 20th Century. Martinus Nijhoff Publishers, 1980.
[2] Ganguli, K. K., “A corpus based approach to the computational modeling of melody in raga music.,” PhD diss., Indian Institute of Technology, Bombay, 2019. https://www.ee.iitb.ac.in/student/~daplab/people/thesis/KKG_Thesis.pdf
[3] Das D., Chowdhury M. FINITE STATE MODELS FOR GENERATION OF HINDUSTANI CLASSICAL MUSIC. http://www.cs.cmu.edu/~dipanjan/pubs/frsm_gen.pdf
[4] Vidwans, V. Computational music. https://computationalmusic.com/index.php
[5] Automatic Music Generation of Indian Classical Music based on Raga. (2023, April 7). IEEE Conference Publication | IEEE Xplore. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10126388
[6] Viramgami, G., Gandhi, H., Naik, H., Mahajan, N., Venkatesh, P., Sahni, S., & Singh, M. (2022). Indian Classical Music Synthesis. 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD). https://doi.org/10.1145/3493700.3493762
[7] Wu, Y. et. al. MIDI-DDSP: Detailed control of musical performance via hierarchical modeling. International Conference on Learning Representations, 2022. https://openreview.net/forum?id=UseMOjWENv
[8] Xin Wang, Shinji Takaki, and Junichi Yamagishi. Autoregressive neural f0 model for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8):1406–1419. https://ieeexplore.ieee.org/document/8341752.
[9] Engel, J. et. al. DDSP: Differentiable Digital Signal Processing. International Conference on Learning Representations, 2019. https://openreview.net/forum?id=B1x1ma4tDr
[10] Dieleman, Sander, et al. The Challenge of Realistic Music Generation: Modelling Raw Audio at Scale. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. https://arxiv.org/abs/1806.10474.
[11] Caillon, A., & Esling, P. (n.d.). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. https://arxiv.org/pdf/2111.05011.pdf
[12] Zeghidour, Neil, et al. “SoundStream: An End-To-End Neural Audio Codec.” Arxiv.org, 7 July 2021, arxiv.org/abs/2107.03312, https://doi.org/10.48550/arXiv.2107.03312. Accessed 16 Oct. 2022.
[13] Défossez, Alexandre, et al. High Fidelity Neural Audio Compression. 24 Oct 2022. https://arxiv.org/abs/2210.13438.
[14] Vaswani, Ashish, et al. “Attention Is All You Need.” ArXiv.org, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, https://arxiv.org/abs/1706.03762.
[15] Caillon, A. (2023, September 14). MSPrior. GitHub. https://github.com/caillonantoine/msprior
[16] Devis, N., Demerlé, N., Nabi, S., Genova, D., & Esling, P. (2023, February 27). Continuous descriptor-based control for deep audio synthesis. https://arxiv.org/abs/2302.13542
[17] Huang, Cheng-Zhi, et al. AI SONG CONTEST: HUMAN-AI CO-CREATION in SONGWRITING. https://arxiv.org/abs/2010.05388
[18] Cherry, E. C., & Latulipe, C. (2014). Quantifying the Creativity Support of Digital Tools through the Creativity Support Index. ACM Transactions on Computer-Human Interaction, 21(4), 1–25. https://doi.org/10.1145/2617588
[19] Ryan Louie, Andy Coenen, Cheng Zhi Huang, Michael Terry, and Carrie J. Cai. 2020. Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative Models. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ‘20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376739
[20] J W Thelle, Notto. “Mixed-Initiative Music Making.”. PhD Diss. 2022 https://nmh.no/en/research/publications/mixed-initiative-music-making\