Hum2Song! is an AI-powered web application that is able to compose the musical accompaniment of a melody produced by the human voice. Demo: https://www.carlostoxtli.com/hum2song/
2. Content
● Summary
○ Brief explanation of the results
● Demo
○ Show how Hum2Song works
● Detailed explanation
○ Explain my journey building it.
3. Hum2Song! is an AI-powered web
application that is able to compose the
musical accompaniment of a melody
produced by a human voice.
Summary - System description
5. Problems predicting genre from the melody
● Genre is an ambiguous concept
● i.e. Pop music means "popular" regardless of the genre
● Much songs combine different genres.
● It is needed multitrack analysis for genre prediction
● The same melody can be used in different genres
6. Results in literature for genre prediction from MIDI
Cory McKay, Automatic Genre Classification of MIDI Recordings
7. Proposed method - 55.8%
After running 1,300 experiments (over 4 conditions), our
best model of single track 1-D features experiment got
55.8% val_acc that overperformed previous work.
8. Best case
Layers: 128, 64, 32, 3
Input: 1D Vector 128 features from drums
Output: 3 classes
Activation functions: RELU & Softmax
Optimizer: Rmsprop
Loss function: Categorical Cross Entropy
Val_acc: 55.8%
9. Layers: [64, 128, 16, 64, 256, 32, 3]
Input: 1D Vector 64 features from melody
Output: 3 classes
Activation functions: RELU & Softmax
Optimizer: Rmsprop
Loss function: Categorical Cross Entropy
Val_acc: 48.6%
Case implemented in the demo
10. RMSprop
It was devised by the legendary Geoffrey Hinton, while suggesting a random
idea during a Coursera class. Consist in divide the learning rate for a weight
by a running average of the magnitudes of recent gradients
for that weight.Gradient Descend Rmsprop
14. My journey - Starting point
● I decided to do it from scratch without consulting previous work.
● I had no domain knowledge (music theory)
● My main area of research is Human Computer Interaction.
● I had no experience building Web-AI apps.
● I only had ~1 month
● My main goal was to learn by trying and to have something to show in
my portfolio.
15. My journey - Steps to follow
● Implement an https site that
allows voice recording
● Implement my model and
Google Magenta models
● Clean the noisy transcribed data
● Get the genre, a drum, a bass, a
tonal scale, and chords
progression from the melody.
● Create a song from progressions
● Adapt a web music editor
● Publish the website
● Promote online demo
● Learn how MIDI files are
structured
● http://www.midiworld.com
scraping (16k files)
● Decide the features to use
● Data preprocessing
● Stratified sampling
● Evaluate several NN architecture
combinations (325 per condition).
● Fine tuning the best options
● Convert the best model to
tensorflow.js
16. Features
● The MIDI file format consists of time series, each note contains a pitch,
a start time and an end time.
● In order to convert the notes to a feature vector is needed to define a
sample rate. I defined 64 (4 seconds) and 128 samples (8 seconds).
● In order to get a pattern that represents the main melody, 2 string
algorithms were applied (learned from String Algorithms class):
○ Longest Common Subsequence (LCS)
○ Longest Repeated Subsequence (LRS)
● Our 4 conditions were Melody 64 features, Melody 128 features, Drums
64 features, and Drums 128 features.
● For the melody conditions we adapted the pitches to the human voice
range.
17. Choosing Neural Network Architecture
● In order to decide which architecture to use, all the possible
combinations of [16, 32, 64, 128, 256] were tested
● 100 epochs were trained per each combination
● The accuracy and confusion matrices were used to pick the best.
● 4 NVIDIA Tesla K80 GPUs were used from Google Colaboratory.
● Keras checkpoints were used to preserve the best models.