This series is inspired by my recent efforts to learn about applying deep learning in the music information retrieval field. I attended a two-week workshop on the same topic at Stanford’s Center for Computer Research in Music and Acoustics, which I highly recommend to anyone who is interested in AI + Music. This series is guided by Choi et al.’s 2018 paper, A Tutorial on Deep Learning for Music Information Retrieval.

It will consist of four parts: Background of Deep Learning, Audio Representations, Convolutional Neural Network, Recurrent Neural Network. This series is only a beginner’s guide.

My intent of creating this series is to break down this topic in a bullet point format for easy motivation and consumption.




Music Information Retrieval

Background of MIR

  • Usually means audio content, but also extends to lyrics, music metadata, or user listening history

  • Audio can be complemented with cultural and social background like genre or era to solve MIR topics

  • Lyric analysis is also MIR, but might not have much to do with audio content

Problems in MIR

mir problems Choi et al., 2018, p.6

Subjectivity

  • Definition: Absolute ground truth does not exist

  • Example: music genres, the mood for a song, listening context, music tags

  • Counter example: pitch, tempo are more defined, but sometimes also ambiguous

  • Why deep learning has achieved good results: it is difficult to manually design useful features when we cannot exactly analyze the logic behind subjectivity

Decision Time Scale

  • Definition: unit time length the prediction is made on

  • Example:

    • Long decision time scale (time invariant problems): tempo and key usually do not change in an excerpt

    • Short decision time scale (time-varying problems): melody extraction usually uses time frames that are really short

Audio Data Representations

  • General background: audio signals are 1D, time-frequency representations are 2D and have a couple of options (listed below)

  • Important to pre-process the data and optimize the effective representation of audio data to save computational costs

  • 2D representations can be considered as images but there are differences between images and time-frequency representations

    • Images are usually locally correlated (meaning nearby pixels will have similar intensities and colors)

    • But spectrograms are often harmonic correlations. Their correlations might be far down the frequency axis and the local correlations are weaker

    • Scale invariance is expected for visual object recognition but probably not for audio-related tasks.

      • Here scale invariance means if you enlarge a picture of a cat, the model will still be able to recognize that it’s a cat.
mir problems Choi et al., 2018, p.6

Audio Signals

  • Samples of the digital audio signals

  • Usually not the most popular choice, considering the sheer amount of data and the expensive cost of computation

    • Sampling rate is usually 44100 Hz, which means one second of audio will have 44100 samples
  • But recently one-dimensional convolutions can be used to learn an alternative of existing time-frequency conversions

Short Time Fourier Transform

  • Definition: The procedure for computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the Fourier transform separately on each shorter segment.

  • (+) Computes faster than other time-frequency representations thanks to FFTs

  • (+) invertible to the audio signal, thus can be used in sonification of learned features and source separation

  • (-) its frequencies are linearly centered and do not match the frequency resolution of human auditory system -> mel spectrogram is

  • (-) not as efficient in size as mel sepctrogram (log scale), but also not as raw as audio signals

  • (-) it is not musically motivated like CQT

Mel Spectrogram

  • 2D representation that is optimized for human auditory perception.

  • (+) It compresses the STFT in frequency axis to match the logarithmic frequency scale of human hearing - hence efficient in size but preserves the most perceptually important information

  • (-) not invertible to audio signals

  • Popular for tagging, boundary detection, onset detection, and learning latent features of music recommendation due to its close proximity to human auditory perception.

Constant-Q Transform (CQT)

  • Definition: It is also a 2D time-frequency representation that provide log-scale centered frequencies. It perfectly matches the frequency distribution of pitches

  • (+) Perfectly matches the pitch frequency distribution so it should be used where fundamental frequencies of notes should be identified

    • Example: Chord recognition and transcription
  • (-) Computation is heavier than the other two

Chromagram

  • Definition: Pitch class profile, provides the energy distribution on a set of pitch class (from C to B)

  • It is more processed than other representations and can be used as a feature by itself, just like MFCC