Journal Summary: ImageNet Classification with Deep Convolutional Neural Networks
In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, three researchers at the University of Toronto, achieved record-breaking performance on a prominent image classification challenge. The challenge was the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), an annual competition in which contenders aim to achieve the most accurate classification of around 150,000 images of 1,000 different categories.
ILSVRC performance is assessed based on top-1 and top-5 error. The contesting algorithms produce ranked odds of an image belonging to a certain category; Top-1 error is the percentage of incorrect #1 answers, while top-5 error is how frequently none of the top five categories are correct. In a way, top-5 error is more meaningful than top-1, since images may contain multiple subjects, so an officially incorrect label might still be “correct” if it was chosen due to peripheral content of an image. For example, an image of a bird eating a seed might be justifiably labeled as either “bird” or “seed,” even if the bird was the primary subject of the image.
With a record-low top-5 error of 15.3%, Krizhevsky et al. won ILSVRC-2012 using their deep convolutional neural network commonly known as AlexNet, besting the runner-up by over 10 points.
AlexNet is composed of 8 trainable layers — 5 convolutional layers followed by 3 dense/fully connected layers. Brandishing about 60 million trainable parameters, AlexNet was complex enough to find sufficient patterns to achieve its record accuracy. However, models which are too large or complex can easily fall victim to overfitting. Krizhevsky et al. found that using dropout layers and data augmentation were efficient ways to reduce overfitting. They used a 50% dropout rate, and as a method of data augmentation, they chose 5 large sections of each image, as well as the horizontal reflection of the sections to train over.
Being one of the largest CNNs at the time, AlexNet was challenging to train in a reasonable amount of time. By using two GPUs to train AlexNet, Krizhevsky et al. were able to significantly decrease training time. Even so, it still took about 5 1/2 days to complete training on their hardware.
An additional characteristic that increased training speed was the use of the ReLU activation function, which doesn’t suffer from the vanishing gradient problem. Using ReLU increased speed by a factor of 7 compared to the same network using the hyperbolic tangent.
In order to achieve their winning results, Krizhevsky et al. actually trained 7 AlexNet-like models and averaged their predictions, yielding the 15.3% top-5 error. But even with a single CNN, they were able to reach 16.6% top-5 error. The second-best contestant managed a top-5 error of 26.2% using a similar technique of averaging the results of several classifiers. AlexNet proved successful not only on the 2012 ImageNet dataset, but on 2009 — 2011 as well.
The number of convolutional layers contributed significantly to the success of AlexNet. Another design decision which made a subtle but noteworthy impact on the results was the use of overlapping pooling layers. These layers serve to compress the model, preserving only the most significant information from the prior layers. Generally pooling layers are designed with contiguous pools, but the use of pools which overlap was shown to reduce overfitting.
One result I found curious is the way each GPU specialized the filters it learned, with one learning color-agnostic features, the other learning color-specific features. I’d be interested to explore how that phenomenon arose. I also find it remarkable that AlexNet performs so well with only 8 trainable layers, given that it took a network with over 100 layers to top it.