# Spoken language identification with deep convolutional networks

Recently TopCoder announced a contest to identify the spoken language in audio recordings. I decided to test how well deep convolutional networks will perform on this kind of data. In short I managed to get around 95% accuracy and finished at the 10th place. This post reveals all the details.

## Dataset and scoring

The recordings were in one of the 176 languages. Training set consisted of 66176 mp3 files, 376 per language, from which I have separated 12320 recordings for validation (Python script is available on GitHub). Test set consisted of 12320 mp3 files. All recordings had the same length (~10 sec) and seemed to be noise-free (at least all the samples that I have checked).

Score was calculated the following way: for every mp3 top 3 guesses were uploaded in a CSV file. 1000 points were given if the first guess is correct, 400 points if the second guess is correct and 160 points if the third guess is correct. During the contest the score was calculated only on 3520 recordings from the test set. After the contest the final score was calculated on the remaining 8800 recordings.

## Preprocessing

I entered the contest just 14 days before the deadline, so didn’t have much time to investigate audio specific techniques. But we had a deep convolutional network developed few months ago, and it seemed to be a good idea to test a pure CNN on this problem. Some Google search revealed that the idea is not new. The earliest attempt I could find was a paper by G. Montavon presented in NIPS 2009 conference. The author used a network with 3 convolutional layers trained on spectrograms of audio recordings, and the output of convolutional/subsampling layers was given to a time-delay neural network.

I found a Python script which creates a spectrogram of a wav file. I used mpg123 library to convert mp3 files to wav format.

The preprocessing script is available on GitHub.

## Network architecture

I took the network architecture designed for the Kaggle’s diabetic retinopathy detection contest. It has 6 convolutional layers and 2 fully connected layers with 50% dropout. Activation function is always ReLU. Learning rates are set to be higher for the first convolutional layers and lower for the top convolutional layers. The last fully connected layer has 176 neurons and is trained using a softmax loss.

It is important to note that this network does not take into account the sequential characteristics of the audio data. Although recurrent networks perform well on speech recognition tasks (one notable example is this paper by A. Graves, A. Mohamed and G. Hinton, cited by 272 papers according to the Google Scholar), I didn’t have time to learn how they work.

I trained the CNN on Caffe with 32 images in a batch, its description in Caffe prototxt format is available here.

 Nr Type Batches Channels Width Height Kernel size / stride 0 Input 32 1 858 256 1 Conv 32 32 852 250 7x7 / 1 2 ReLU 32 32 852 250 3 MaxPool 32 32 426 125 3x3 / 2 4 Conv 32 64 422 121 5x5 / 1 5 ReLU 32 64 422 121 6 MaxPool 32 64 211 60 3x3 / 2 7 Conv 32 64 209 58 3x3 / 1 8 ReLU 32 64 209 58 9 MaxPool 32 64 104 29 3x3 / 2 10 Conv 32 128 102 27 3x3 / 1 11 ReLU 32 128 102 27 12 MaxPool 32 128 51 13 3x3 / 2 13 Conv 32 128 49 11 3x3 / 1 14 ReLU 32 128 49 11 15 MaxPool 32 128 24 5 3x3 / 2 16 Conv 32 256 22 3 3x3 / 1 17 ReLU 32 256 22 3 18 MaxPool 32 256 11 1 3x3 / 2 19 Fully connected 20 1024 20 ReLU 20 1024 21 Dropout 20 1024 22 Fully connected 20 1024 23 ReLU 20 1024 24 Dropout 20 1024 25 Fully connected 20 176 26 Softmax Loss 1 176

Hrant suggested to try the ADADELTA solver. It is a method which dynamically calculates learning rate for every network parameter, and the training process is said to be independent of the initial choice of learning rate. Recently it was implemented in Caffe.

In practice, the base learning rate set in the Caffe solver did matter. At first I tried to use 1.0 learning rate, and the network didn’t learn at all. Setting the base learning rate to 0.01 helped a lot and I trained the network for 90 000 iterations (more than 50 epochs). Then I switched to 0.001 base learning rate for another 60 000 iterations. The solver is available here. Not sure why the base learning rate mattered so much at the early stages of the training. One possible reason could be the large learning rate coefficients on the lower convolutional layers. Both tricks (dynamically updating the learning rates in ADADELTA and large learning rate coefficients) aim to fight the gradient vanishing problem, and maybe their combination is not a very good idea. This should be carefully analysed.

Training (blue) and validation (red) loss over the 150 000 iterations on the non-augmented dataset. The sudden drop of training loss corresponds to the point when the base learning rate was changed from 0.01 to 0.001. Plotted using this script.

The signs of overfitting were getting more and more visible and I stopped at 150 000 iterations. The softmax loss got to 0.43 and it corresponded to 3 180 000 score (out of 3 520 000 possible). Some ensembling with other models of the same network allowed to get a bit higher score (3 220 000), but it was obvious that data augmentation is needed to overcome the overfitting problem.

## Data augmentation

The most important weakness of our team in the previous contest was that we didn’t augment the dataset well enough. So I was looking for ways to augment the set of spectrograms. One obvious idea was to crop random, say, 9 second intervals of the recordings. Hrant suggested another idea: to warp the frequency axis of the spectrogram. This process is known as vocal tract length perturbation, and is generally used for speaker normalization at least since 1998. In 2013 N. Jaitly and G. Hinton used this technique to augment the audio dataset. I used this formula to linearly scale the frequency bins during spectrogram generation:

Frequency warping formula from the paper by L. Lee and R. Rose. α is the scaling factor. Following Jaitly and Hinton I chose it uniformly between 0.9 and 1.1

I also randomly cropped the spectrograms so they had 768x256 size. Here are the results:

 Spectrogram of one of the recordings Cropped spectrogram of the same recording with warped frequency axis

For each mp3 I have created 20 random spectrograms, but trained the network on 10 of them. It took more than 2 days to create the augmented dataset and convert it to LevelDB format (the format Caffe suggests). But training the network proved to be even harder. For 3 days I couldn’t significantly decrease the train loss. After removing the dropout layers the loss started to decrease but it would take weeks to reach reasonable levels. Finally, Hrant suggested to try to reuse the weights of the model trained on the non-augmented dataset. The problem was that due to the cropping, the image sizes in the two datasets were different. But it turned out that convolutional and pooling layers in Caffe work with images of variable sizes, only the fully connected layers couldn’t reuse the weights from the first model. So I just renamed the FC layers in the prototxt file and initialized the network (convolution filters) by the weights of the first model:

This helped a lot. I used standard stochastic gradient descent (inverse decay learning rate policy) with base learning rate 0.001 for 36 000 iterations (less than 2 epochs), then increased the base learning rate to 0.01 for another 48 000 iterations (due to the inverse decay policy the rate decreased seemingly too much). These trainings were done without any regularization techniques, weight decay or dropout layers, and there were clear signs of overfitting. I tried to add 50% dropout layers on fully connected layers, but the training was extremely slow. To improve the speed I used 30% dropout, and trained the network for 120 000 more iterations using this solver. Softmax loss on the validation set reached 0.21 which corresponded to 3 390 000 score. The score was calculated by averaging softmax outputs over 20 spectrograms of each recording.

## Ensembling

30 hours before the deadline I had several models from the same network. And even simple ensembling (just the sum of softmax activations of different models) performed better than any individual model. Hrant suggested to use XGBoost, which is a fast implementation of gradient boosting algorithm and is very popular among Kagglers. XGBoost has a good documentation and all parameters are well explained.

To perform the ensembling I was creating a CSV file containing softmax activations (or the average of softmax activations among 20 augmented versions of the same recording) using this script. Then I was running XGBoost on these CSV files. The submission file (which was requested by TopCoder) was generated using this script.

I also tried to train a simple neural network with one hidden layer on the same CSV files. The results were significantly better than with XGBoost.

The best result was obtained by ensembling the following two models: snapshots of the last network (the one with 30% dropout) after 90 000 iterations and 105 000 iterations. Final score was 3 401 840 and it was the 10th result of the contest.

## What we learned from this contest

This was a quite interesting contest, although too short when compared with Kaggle’s contests.

• Plain, AlexNet-like convolutional networks work quite well for fixed length audio recordings
• Vocal tract length perturbation works well as an augmentation technique
• Caffe supports sharing weights between convolutional networks having different input sizes
• Single layer neural network sometimes performs better than XGBoost for ensembling (although I had just one day to test the both)

## Unexplored options

• It is interesting to see if a network with 50% dropout layers will improve the accuracy
• Maybe larger convolutional networks, like OxfordNet will perform better. They require much more memory, and it was risky to play with them under a tough deadline
• Hybrid methods combining CNN and Hidden Markov Models should work better
• We believe it is possible to squeeze more from these models with better ensembling methods
• Other contestants report better results based on careful mixing of the results of more traditional techniques, including n-gram and Gaussian Mixture Models. We believe the combination of these techniques with the deep models will provide very good results on this dataset

One important issue is that the organizers of this contest do not allow to use the dataset outside the contest. We hope this decision will be changed eventually.