top of page

Harmony in Complexity:
How the Human-Music Connection Can Propel the Future of Machine Learning

Kelland Harrison

     Have you ever wondered what your life would be like if you were a different person? If you were a Renaissance-era Italian, an ancient Egyptian, or a Polynesian settling the Pacific, how would you spend your time? Your beliefs, values, and goals would all be different. What would remain the same? Across all human cultures, there are some constants. On one hand, we have conflict, struggle, and war. On the other hand, we have compassion, love, and music. The modern sentiment is to regulate the former category, to use competition as a tool to better the human condition. In contrast, music and compassion are increasingly celebrated.    

          The universality of music originates in evolution. The development of language and music are thought to be entwined. The exact nature of this relationship is under debate. Some people think that music perception came before complex language and vice versa.3 Perhaps the more holistic stance is that language and music co-evolved.

          While there are many similarities between the two, music and language processing are distinct capacities. The part of our brains that process language is not used to process music. Scans of the language system show insignificant activity when patients are listening to music.2 Furthermore, language impairment has been shown to have little effect on music perception. Interestingly, music impairment can affect a person’s aptitude with tonal languages. Nevertheless, we observe that the brain’s music and language processing systems are largely independent. Currently, generative language models are at the forefront of AI. These models work on text, not audio. While this disability prevents the models from perceiving music, it also limits their capacity to understand language. In speech, much of the information is conveyed through intonation and phrasing. A text-based language model is blind to all of this. In humans, the ability to recognize these audio-only cues is connected to music perception. Studies have linked impaired music perception with difficulty recognizing emotion in speech. A positive correlation has also been found: musicians tend to be better at recognizing emotions in speech. This connection indicates that music perception could be a valuable ability for audio-processing language models.

          Audio is a difficult medium to analyze. Compared with text, audio has more noise. There can be multiple people speaking at the same time, more kinds of information, and the listener may be interested in any combination of these. Different AI models have been developed to handle different aspects of audio processing. Convolutional neural networks have been used for isolating voices6, and transformer networks have been used for generating audio. These architectures come from image and text processing. Considering the unique challenges of audio, we may be able to improve audio-models by taking inspiration from the brain’s auditory system. We have seen that the audio system has multiple branches: language and music processing. Since music is a purely auditory phenomena, the music perception system may hold some undiscovered gems for audio-models. One potential inspiration is beat synchronization. When we listen to music, we often nod our head or tap our fingers to the beat. This synchronization is anticipatory. We must predict the next beat in order to synchronize our movements with it. Predictive coding has been suggested as the underlying mechanism.


              Predictive coding is a general theory of brain function. The idea is that the brain is always trying to predict its own future state. The resulting prediction error is used to update the brain’s predictive model. These two steps, prediction and error update, are very familiar to anyone who has trained a neural network with backpropagation. One difference is that back propagation separates the training and operation of a model, while predictive coding supposes that training is integrated into a model’s operation. Integrating training and operation would allow models to continue learning after deployment. When we release AI models into the real world, they will come across situations that we have never trained them for. If models were able to improve in-situ, they might learn to solve these novel problems. Reconsider the constants of the human condition: conflict, compassion, and music. Each of these present challenges that evolve in real time. Conflict and compassion require updating predictive models of yourself and other people. Music perception creates and updates a predictive model of sound. The similarities are clear. Perhaps someday AI models designed to process music will help us understand other aspects of the human condition.

bottom of page