MockingBot

Justin Masayda
5 min readSep 20, 2022

My Holberton School Portfolio Project

I’ve just completed my final assignment in Holberton School’s machine learning specialization. It’s been an intense 9 months. I knew virtually nothing about machine learning when I started, and now that I know something about it, I realize that I still know virtually nothing about machine learning compared to how much there is to know. But I learned enough to tackle an idea I had at the beginning of the program — an idea inspired by a problem.

The Problem

One of my favorite video games is No Man’s Sky, famous for it’s procedurally generated and practically endless universe. Despite it’s creative possibilities, one aspect of the game that stood out to me was the sound of the fauna. The alien fauna have a wide range of appearances, but they essentially recycle the same sounds. It somewhat breaks the illusion of being in another world, reminding you it’s still a simulation. I wondered if there was a way to generate audio that was as varied and fantastic as the alien creatures’ bodies.

The solution

After studying machine learning for a few months, it seemed like it offered a potential solution. I could design a model to generate audio and train it on some arbitrary animal sounds. The question was what kind of model to train. My first experience with generative models was building a variational autoencoder, so I decided to start there. But before I could do that, I needed training data.

Training Data

I figured the class of audio was not particularly important, since it could be changed and the model could be retrained later. I just needed something to work with. I already had a lot of musical instrument samples from doing music production, so I made a folder of around 900 kick drum samples for training on. Kicks seemed like a good choice since they have a wide frequency spectrum and are relatively short in duration, most samples probably being less than a second long, which I expected would make training relatively quick.

The first thing I had to figure out was how to load them as TensorFlow tensors. This proved challenging as TensorFlow has limited support for WAV files. I found a function in TensorFlow I/O that I expected to work, but TensorFlow I/O was not available on my MacBook Air (M1, 2020). So I switched over to developing in Google Colab. That allowed me to open some of my files, but I discovered that they were not all in the same WAV format. Aparently all WAV files are not created equal. There are sub-formats of WAV, and TFIO does not support all of the ways my files were formatted. In fact, a significant number of them wouldn’t load. So, I needed new data that was formatted consistently.

Fortunately, without much trouble, I found the CatMeows dataset, which became my training data. It’s formatted consistently, the samples are short but complex, and the sample rate is only 8 kHz, which helped reduce the computational load. It was also a collection of actual animal sounds, which better aligned with my original intent than kick drums.

Preprocessing

Though the data was pretty consistent already, I still had to determine whether I wanted to train on the raw audio, or if I wanted to extract features first. I knew about similar projects that trained over spectrograms, but my initial reaction to this was that it seemed like an extra step compared to training over the audio. However, after looking into it, I opted to use them as I realized spectrograms had the advantage of extracting frequency information, disregarding phase (which reduces complexity without much loss of quality), and transforming the audio from one dimension to two, which enabled the use of (de)convolutional neural networks.

I spent some time learning about the Fourier transform, and used it as part of my input pipeline to create spectrograms from my WAV files. I also normalized the spectrograms to contain values ranging from [0–1].

The Model

I started with a VAE, as I had built one before. However, after many rounds of training and experimenting with architecture and hyperparameters, it became apparent to me that a VAE would never generate something that was too divergent from the training data, since it’s penalized for poor reconstructions of the input. Consequently, I switched to a GAN. The advantage of a GAN is that it isn’t necessarily penalized for creating something that isn’t in the dataset so long as it can pass for authentic training data. I had never built a GAN, so I spend some time studying them, but they aren’t terribly complicated in principle. The question was what architecture to use for the generator and discriminator.

The following diagram shows the architecture which gave me the best results:

MockingBot’s GAN architecture

Unfortunately, after dozens of experiments with architectures and hyperparameters, I never produced the convincing realism I was striving for, but I was able to determine that using a GAN was the right approach.

Proposed Improvements

I suspect that I could get better performance using a deconvolutional neural network, longer training, and a deeper network. More consistency in the training examples might also work, as I realize there’s still quite a bit of variation in the audio in terms of how loud the signal is, the amount of background noise, the amount of time in the recording in which the audio is not active, etc. Additionally, since I disregarded the phase of the signal in the creation of the spectrogram, the inverse Fourier transform has no phase information, which introduces a certain distortion in the reconstruction of the audio signal. I don’t expect this to be the most significant factor in the quality of the output, but it certainly contributes.

What I Learned

I learned:

  • The limitations of VAEs.
  • What GANs are and what they’re capable of.
  • How the Fourier Transform and STFT work.
  • How to track experiments with TensorBoard.
  • That audio processing is difficult, but awesome.
  • Bonus: I learned that for a single oscillating signal, you can compute the exact frequency and amplitude at any point in the signal (instantaneous frequency/amplitude), even if they aren’t constant, without the Fourier Transform. Not sure how useful that is, but it’s pretty cool.

The project can be seen on my GitHub:

https://github.com/keysmusician/MockingBot

--

--

Justin Masayda

Software engineer | Machine learning specialist | Learning audio programming | Jazz pianist | Electronic music producer