STEM I

STEM is taught by Dr. Kevin Crowthers, Ph.D. In STEM I, we conduct a six-month-long research project, learning about understanding research papers, practicing formal writing, and, of course, experimentation and creativity.

Using Machine Learning Techniques to Improve Computer-Based Assessment of Musical Performances

Abstract

Current technology for assessing musical performance does not take into account factors like tone quality and the limitations of the instrument being played on. The goal is to develop a machine learning algorithm that can determine what instrument the user is playing on and if their playing matches the style and tone of the piece. Machine learning algorithms can be applied to audio spectrograms, which can give users tailored information on more human aspects of performance, such as tone quality, technique, and texture (which evolves from how the instrument is played and the physical design of the instrument). With the rise of technology, computer programs and applications have been created to analyze a user's musical performance. The usage of these apps has become increasingly popular during the COVID-19 pandemic, when musicians had limited access to travel and gatherings with teachers. Music teachers also experienced increased burnout during the pandemic, making them less available to students (Sarikaya, 2021). Sarikaya's results were consistent regardless of age group, showing that this is a problem for a diverse range of teachers, and, consequently, students. The data is recorded and then converted into a spectrogram using Python. The spectrogram is the test data for the machine learning algorithm, which aims to detect certain parts of tone. This is compared to the ideal result and then the model tweaks itself to improve over time. The network is a convolutional neural network trained to detect borders, shapes, and general patterns. The adapted model had an accuracy of 81% with 1000 training samples.

Graphical Abstract

Research Proposal

Phrase 1

Existing technologies for analyzing musical performances only assess frequency and rhythm; however, it does not take into account more human aspects of music such as tone and musical technique.

Phrase 2

The goal of the research is to develop an algorithm that can differentiate between different types of tone and return to the user (in an understandable way) of how close their playing matched the ideal tone and when they made their mistakes.

Background (Infographic + Written Description)

Music is deeply imbedded in international culture and has brought joy to people for tens of thousands of years since its inception. Most people in modern society have played music before. Some musicians dedicate their lives to this practice, and therefore are constantly looking for ways to hone their skills. One essential part of becoming a better musician is receiving quality feedback on their performances. Musicians typically look for criticism regarding tone (or, more generally, timbre), rhythm, articulation, intonation, and style. Historically, aspiring musicians have sought the help of a teacher or mentor who is an expert in the performer's instrument. Such mentors provide useful advice since humans have an intuitive understanding of musical quality (e.g. the difference between a "good" and "bad" sound, how different articulations sound, what emotions the performer is trying to convey, etc.). However, with the rise of technology, computer programs and applications have been created to analyze a user's musical performance. The usage of these apps has become increasingly popular during the COVID-19 pandemic, when musicians had limited access to travel and gatherings with teachers. Music teachers also experienced increased burnout during the pandemic, making them less available to students (Sarikaya, 2021). The main problem with these programs is they are only able to detect the note accuracy and rhythm of the performance and return a score to the user based on those two metrics. As a consequence, many musicians feel they do not get adequate feedback from such programs.
To improve a computer's analysis of musical performance, we aim to improve two aspects: recognition (of the different features of music) and understanding (of how these features are). The first major gap in knowledge exists in detecting different instruments. This is essential in assessing techniques during performances. For example, playing a staccato passage (short notes with brief pauses) would sound different on a flute and a violin because a flute's sound would flow more while the violin's would sound more choppy. Since modern musical analyzers only detect rhythm and note accuracy, they cannot make this distinction. To solve this problem, we can analyze the frequencies emanated by different instruments when playing the same note. When playing a particular note on a given instrument, most frequencies of the audio coming from it are near or at the frequency of the note. But there are also "other frequencies that give each instrument its particular qualities" (Cuff, 2016). Therefore, a major goal of this project is creating an algorithm that can recognize these different frequencies for each instrument. Generating the constituent frequencies from the time-domain representation of the raw audio is not difficult, as one can convert the time-domain representation into a frequency-domain representation using a Fast Fourier Transform (FFT). My strategy is to identify the frequencies with the highest magnitude that are not within a certain range of the desired frequency (the note). As with any ML algorithm, the values of which frequencies correspond to which instruments will improve over time. A second major flaw in musical analyzers is their inability to recognize different musical techniques. For example, modern programs cannot detect a crescendo, or recognize the difference between pizzicato (plucking a violin string) and arco (pulling on a violin string with a bow). These techniques can require gaps between different notes, and the change in dynamics of certain frequencies at a given point in time. To solve these problems, we can transform the time-domain representation of the raw audio files into a 2D spectrogram, which gives us the frequency and magnitude at any given time. We can then use an image processing software, with weights and biases altered to recognize gaps (for staccato vs. legato) and changes in brightness, or magnitude (for dynamic changes). A similar method has been proven to be effective in the past for recognizing different sounds (Thornton et. al., 2019). Lastly, computers have, as of yet, not been able to determine what elements are more important than others for performing a specific piece of music. While they do have the score for any given piece, they do not change their algorithm depending on which piece is being played. My project intends to, for the first time, be able to prioritize certain elements of music before others and change its algorithm accordingly. This means that for rhythmically complex pieces, intonation and tone are less relevant than they would be in a slow, rhythmically simpler piece. Developing such an algorithm would also allow the individual playing styles of performers to shine through. In a piece where the style is "aggressive", it might be important to exaggerate sforzandos/accents (sudden loud notes) and pauses between notes. It can even be applied to recordings of old pieces, with a clear distinction in the "manner of execution" of certain rhythms for historically accurate performances (musicians who want to emulate the performance during the time period it was written in) modernized performances (performances for our generation's audience) (Liebman et. al., 2012). An algorithm for weighting rhythm against the other elements is to count the number of significant rhythm changes, and then create a machine learning algorithm that takes the input of a user playing that recording to update which spots (i.e. time slots) the user is having trouble with rhythm.

Procedure (Infographic + Written Description)

Current technology for assessing musical performance does not take into account factors like tone quality and the limitations of the instrument being played on. The goal is to develop a machine learning algorithm that can determine what instrument the user is playing on and if their playing matches the style and tone of the piece. Machine learning algorithms can be applied to audio spectrograms, which can give users tailored information on more human aspects of performance, such as tone quality, technique, and texture (which evolves from how the instrument is played and the physical design of the instrument). The data is recorded and then converted into a spectrogram using Python. The spectrogram is the test data for the machine learning algorithm, which aims to detect certain parts of tone. This is compared to the ideal result and then the model tweaks itself to improve over time. The network is a convolutional neural network trained to detect borders, shapes, and general patterns. Recognizing quality of tone and musical technique primarily consists of extracting patterns from spectrograms, specifically shapes. To determine the qualities that the algorithm must detect, the spectrograms were manually observed. For example, in samples containing vibrato (a musical technique in which the frequency of the note vibrates slightly), the overtones of the primary frequency looks like a sine wave. However, in recordings without vibrato, the overtones are flat. To extract the borders of the relevant portions, which show up as bright colors on a spectrogram, a convolutional neural network is run on the image. The kernel will be trained with different sizes and different numbers inside the matrix in order to maximize both runtime and accuracy. The extracted pixels are then mapped as a function of time. This function is compared to an ideal "perfect" playing scenario. These perfect scenarios are sourced from YouTube recordings and other online audio- and video- sharing platforms. This comparison of the experimental function vs. the theoretical perfect function will return an accuracy rating between 0 and 1. The same process is repeated for a whole host of other factors, such as tone quality, intonation, the note envelope (how the note starts/stops, and more. In the end, the process returns a vector containing all the accuracy comparison values (again, all numbers between 0 and 1). Lastly, the computer will look at the piece given by the user. By looking at certain factors of the music (e.g. the number of notes, the rhythms, the time period of the piece) the algorithm will categorize an importance level for each metric (like tone, intonation, vibrato). For example, a piece with many notes would probably give more weight to rhythm and intonation than tone quality, as it is harder to produce quality tone when playing fast. Lastly, this weighted analysis is returned to the user in some user-friendly form telling them where they made mistakes and what specific changes to make to improve.

Below are some images generated throughout the development of this project:



Analysis

The research presented offers a leap in the way that musicians perform analysis on their music. The primary difference between the model presented and the state-of-the-art technology is the information my model provides about quality of sound (as perceived by the human ear). This includes tone quality, musical technique, and the timbre of the instrument (in particular, instrument identification). Commerical models, such as SmartMusic and Flowkey, observe only information regarding note accuracy and rhythm/tempo. This is achieved by transforming the raw audio waveform into its constituent frequencies, and retrieving the primary frequency with high amplitude. While efficient, this method ignores a significant amount of data necessary for understanding the quality and information of the sound. The note could be played scratchily, or on the wrong instrument, and the system would treat it just the same. My model incorporates a more effective method of musical analysis that parses through patterns in frequency data that current systems miss. Tone quality is not emergent in the pattern of highest frequency values, but is emergent in the groups of high-amplitude frequencies over time. For example, audio recordings with and without a clean envelope can be detected through the number of small gaps in the frequencies surrounding the main frequency. Qualities such as dynamics are also missed by current models. I patched this hole by considering smaller time slots and analyzing the correlation of the amplitude of the main frequency (and the frequencies around it) to determine the efficacy of the dynamics shown compared to the values that they should be.

Discussion and Conclusion

Methods such as these have been proven to be effective in the past. In 2019, B. Z. Leitner and S. Thorton tested the performance of a machine learning model that edited the layers of a previous model. A machine learning model is built in layers. Each group of neurons (cells that hold numbers), has a set of operations, defined by the weights and biases, that maps it to the next layer. The neurons in the middle have no meaning to humans, but they can eventually be traced to have some vague meaning based on what we want the model to do. Changing the core layers of a machine learning model, a few layers after the input is given, fundamentally changes the model's inner workings. It is like replacing a human brain with the brain of a different species. However, changing the last few layers of a machine learning model yielded a similar model, but can be specialized for certain cases (in our case, music). This is more like replacing the brain of a human painter with that of a human musician. Leitner and Thorton tried to do exactly that and created three machine learning models: one built from the ground up, one edited version (by changing the final 10 or so layers) of an image processing algorithm, and an unedited version of the image processing algorithm. The image processing algorithm was specialized for spectrograms to classify different types of audio. The edited image processing algorithm had an accuracy rate of 88.5%, which was very close to the tailored ground-up model which had an accuracy of 88.9%. (The control algorithm, the unedited image processing algorithm, had an accuracy rate of 82.9%, significantly worse than the other two models). My project built off of something similar and use shape detection algorithms on images (3D spectrograms) on top to get the computer to understand what it is analyzing.

References

February Fair Poster