Combining CNN and RNN for spoken language identification

26 Jun 2016

By Hrayr Harutyunyan and Hrant Khachatrian

Last year Hrayr used convolutional networks to identify spoken language from short audio recordings for a TopCoder contest and got 95% accuracy. After the end of the contest we decided to try recurrent neural networks and their combinations with CNNs on the same task. The best combination allowed to reach 99.24% and an ensemble of 33 models reached 99.67%. This work became Hrayr’s bachelor’s thesis.

Inputs and outputs
Network architecture
Ensembling
Final remarks

Inputs and outputs

As before, the inputs of the networks are spectrograms of speech recordings. It seems spectrograms are the standard way to represent audio for deep learning systems (see “Listen, Attend and Spell” and “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”).

Some networks use up to 11khz frequencies (858 x 256 image) and others use up to 5.5khz frequencies (858 x 128 image). In general the networks which use up to 5.5khz frequencies perform a little bit better (probably because the higher frequencies do not contain much useful information and just make overfitting easier).

The output layer of all networks is a fully connected softmax layer with 176 units.

We didn’t augment the data using vocal tract length augmentation.

Network architecture

We have tested several network architectures. First set of architectures are plain AlexNet-like convolutional networks. The second set contains no convolutions and interprets the columns of the spectrogram as a sequence of inputs to a recurrent network. The third set applies RNN on top of the features extracted by a convolutional network. All models are implemented in Theano and Lasagne.

Almost all networks easily reach 100% accuracy on the training set. In the following tables we describe all architectures we tried and report accuracy on the validation set.

Convolutional networks (CNN)

The network consists of 6 blocks of 2D convolution, ReLU nonlinearity, 2D max pooling and batch normalization. We use 7x7 filters for the first convoluational layer, 5x5 for the second and 3x3 for the rest. Pooling size is always 3x3 with a stride 2.

Batch normalization significantly increases the training speed (this fact is reported in lots of recent papers). Finally we use only 1 fully connected layer between the last pooling layer and the softmax layer, and apply 50% dropout on that.

Network	Accuracy	Notes
tc_net	<80%	The difference between this network and the CNN descibed in the previous work is that this network has only one fully connected layer. We didn’t train this network much because of `ignore_border=False`, which slows down the training
tc_net_mod	97.14	This network is the same as `tc_net` but instead of `ignore_border=False`, we put `pad=2`
tc_net_mod_5khz_small	96.49	This network is a smaller copy of `tc_net_mod` network and works with up to 5.5khz frequencies

The Lasagne setting ignore_border=False prevents Theano from using CuDNN. Setting it to True significantly increased the speed.

Here is the detailed description of the best network of this set: tc_net_mod.

Nr	Type	Channels	Width	Height	Kernel size / stride
0	Input	1	858	256
1	Conv	16	852	250	7x7 / 1
	ReLU	16	852	250
	MaxPool	16	427	126	3x3 / 2, pad=2
	BatchNorm	16	427	126
2	Conv	32	423	122	5x5 / 1
	ReLU	32	423	122
	MaxPool	32	213	62	3x3 / 2, pad=2
	BatchNorm	32	213	62
3	Conv	64	211	60	3x3 / 1
	ReLU	64	211	60
	MaxPool	64	107	31	3x3 / 2, pad=2
	BatchNorm	64	107	31
4	Conv	128	105	29	3x3 / 1
	ReLU	128	105	29
	MaxPool	128	54	16	3x3 / 2, pad=2
	BatchNorm	128	54	16
5	Conv	128	52	14	3x3 / 1
	ReLU	128	52	14
	MaxPool	128	27	8	3x3 / 2, pad=2
	BatchNorm	128	27	8
6	Conv	256	25	6	3x3 / 1
	ReLU	256	25	6
	MaxPool	256	14	3	3x3 / 2, pad=2
	BatchNorm	256	14	3
7	Fully connected	1024
	ReLU	1024
	BatchNorm	1024
	Dropout	1024
8	Fully connected	176
	Softmax Loss	176

During the training we accidentally discovered a bug in Theano, which was quickly fixed by Theano developers.

Recurrent neural networks (RNN)

The spectrogram can be viewed as a sequence of column vectors that consist of 256 (or 128, if only <5.5KHz frequencies are used) numbers. We apply recurrent networks with 500 GRU cells in each layer on these sequences.

GRU runs directly on the spectrogram

Network	Accuracy	Notes
rnn	93.27	One GRU layer on top ot the input layer
rnn_2layers	95.66	Two GRU layers on top ot the input layer
rnn_2layers_5khz	98.42	Two GRU layers on top ot the input layer, maximum frequency: 5.5khz

The second layer of GRU cells improved the performance. Cropping out frequencies above 5.5KHz helped fight overfitting. We didn’t use dropout for RNNs.

Both RNNs and CNNs were trained using adadelta for a few epochs, then by SGD with momentum (0.003 or 0.0003) until overfitting. If SGD with momentum is applied from the very beginning, the convergence is very slow. Adadelta converges faster but usually doesn’t reach high validation accuracy.

Combinations of CNN and RNN

The general architecture of these combinations is a convolutional feature extractor applied on the input, then some recurrent network on top of the CNN’s output, then an optional fully connected layer on RNN’s output and finally a softmax layer.

The output of the CNN is a set of several channels (also known as feature maps). We can have separate GRUs acting on each channel (with or without weight sharing) as described in this picture:

Multiple GRUs run on CNN output

Another option is to interpret CNN’s output as a 3D-tensor and run a single GRU on 2D slices of that tensor:

Single GRU runs on CNN output

The latter option has more parameters, but the information from different channels is mixed inside the GRU, and it seems to improve performance. This architecture is similar to the one described in this paper on speech recognition, except that they also use some residual connections (“shortcuts”) from input to RNN and from CNN to fully connected layers. It is interesting to note that recently it was shown that similar architectures work well for text classification.

Network	Accuracy	Notes
tc_net_rnn	92.4	CNN consists of 3 convolutional blocks and outputs 32 channels of size 104x13. Each of these channels is fed to a separate GRU as a sequence of 104 vectors of size 13. The outputs of GRUs are combined and fed to a fully connected layer
tc_net_rnn_nodense	91.94	Same as above, except there is no fully connected layer on top of GRUs. Outputs of GRU are fed directly to the softmax layer
tc_net_rnn_shared	96.96	Same as above, but the 32 GRUs share weights. This helped to fight overfitting
tc_net_rnn_shared_pad	98.11	4 convolutional blocks in CNN using `pad=2` instead of `ignore_broder=False` (which enabled CuDNN and the training became much faster). The output of CNN is a set of 32 channels of size 54x8. 32 GRUs are applied (one for each channel) with shared weights and there is no fully connected layer
tc_net_deeprnn_shared_pad	95.67	4 convolutional block as above, but 2-layer GRUs with shared weights are applied on CNN’s outputs. Overfitting became stronger because of this second layer
tc_net_shared_pad_augm	98.68	Same as tc_net_rnn_shared_pad, but the network randomly crops the input and takes 9s interval. The performance became a bit better due to this
tc_net_rnn_onernn	99.2	The outputs of a CNN with 4 convolutional blocks are grouped into a 32x54x8 3D-tensor and a single GRU runs on a sequence of 54 vectors of size 32*8
tc_net_rnn_onernn_notimepool	99.24	Same as above, but the stride along the time axis is set to 1 in every pooling layer. Because of this the CNN outputs 32 channels of size 852x8

The second layer of GRU in this setup didn’t help due to the overfitting.

It seems that subsampling in the time dimension is not a good idea. The information that is lost during subsampling can be better used by the RNN. In the paper on text classification by Yijun Xiao and Kyunghyun Cho, the authors even suggest that maybe all pooling/subsampling layers can be replaced by recurrent layers. We didn’t experiment with this idea, but it looks very promising.

These networks were trained using SGD with momentum only. The learning rate was set to 0.003 for around 10 epochs, then it was manually decreased to 0.001 and then to 0.0003. On average, it took 35 epochs to train these networks.

Ensembling

The best single model had 99.24% accuracy on the validation set. We had 33 predictions by all these models (there were more than one predictions for some models, taken after different epochs) and we just summed up the predicted probabilities and got 99.67% accuracy. Surprisingly, our other attempts of ensembling (e.g. majority voting, ensemble only on some subset of all models) didn’t give better results.

Final remarks

The number of hyperparameters in these CNN+RNN mixtures is huge. Because of the limited hardware we covered only a very small fraction of possible configurations.

The organizers of the original contest did not publicly release the dataset. Nevertheless we release the full source code on GitHub. We couldn’t find many Theano/Lasagne implementations of CNN+RNN networks on GitHub, and we hope these scripts will partially fill that gap.

This work was part of Hrayr’s bachelor’s thesis, which is available on academia.edu (the text is in Armenian).

YerevaNN Blog on neural networks