Photo by Kelly Sikkema on Unsplash


In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, three researchers at the University of Toronto, achieved record-breaking performance on a prominent image classification challenge. The challenge was the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), an annual competition in which contenders aim to achieve the most accurate classification of around 150,000 images of 1,000 different categories.

ILSVRC performance is assessed based on top-1 and top-5 error. The contesting algorithms produce ranked odds of an image belonging to a certain category; Top-1 error is the percentage of incorrect #1 answers, while top-5 error is how frequently none of the top five categories are correct. In a way, top-5 error is more meaningful than top-1, since images may contain multiple subjects, so an officially incorrect label might still be “correct” if it was chosen due to peripheral content of an image. For example, an image of a bird eating a seed might be justifiably labeled as either “bird” or “seed,” even if the bird was the primary subject of the image.

With a record-low top-5 error of 15.3%, Krizhevsky et al. won ILSVRC-2012 using their deep convolutional neural network commonly known as AlexNet, besting the runner-up by over 10 points.


AlexNet is composed of 8 trainable layers — 5 convolutional layers followed by 3 dense/fully connected layers. Brandishing about 60 million trainable parameters, AlexNet was complex enough to find sufficient patterns to achieve its record accuracy. However, models which are too large or complex can easily fall victim to overfitting. Krizhevsky et al. found that using dropout layers and data augmentation were efficient ways to reduce overfitting. They used a 50% dropout rate, and as a method of data augmentation, they chose 5 large sections of each image, as well as the horizontal reflection of the sections to train over.

Being one of the largest CNNs at the time, AlexNet was challenging to train in a reasonable amount of time. By using two GPUs to train AlexNet, Krizhevsky et al. were able to significantly decrease training time. Even so, it still took about 5 1/2 days to complete training on their hardware.

An additional characteristic that increased training speed was the use of the ReLU activation function, which doesn’t suffer from the vanishing gradient problem. Using ReLU increased speed by a factor of 7 compared to the same network using the hyperbolic tangent.


In order to achieve their winning results, Krizhevsky et al. actually trained 7 AlexNet-like models and averaged their predictions, yielding the 15.3% top-5 error. But even with a single CNN, they were able to reach 16.6% top-5 error. The second-best contestant managed a top-5 error of 26.2% using a similar technique of averaging the results of several classifiers. AlexNet proved successful not only on the 2012 ImageNet dataset, but on 2009 — 2011 as well.


The number of convolutional layers contributed significantly to the success of AlexNet. Another design decision which made a subtle but noteworthy impact on the results was the use of overlapping pooling layers. These layers serve to compress the model, preserving only the most significant information from the prior layers. Generally pooling layers are designed with contiguous pools, but the use of pools which overlap was shown to reduce overfitting.

Personal Notes

One result I found curious is the way each GPU specialized the filters it learned, with one learning color-agnostic features, the other learning color-specific features. I’d be interested to explore how that phenomenon arose. I also find it remarkable that AlexNet performs so well with only 8 trainable layers, given that it took a network with over 100 layers to top it.





Software Engineering Student at Holberton School | Interested in DSP and audio programming | Jazz pianist | Electronic music producer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Step by step implementation of BERT for text categorization task

What Machine Learning algorithm should I use?

Research Papers based on Named Entity Recognition part2(Natural Language Processing)

Recommendations with neural networks

Understanding Bayesian Optimization part 2(AI + Statistics)

MLOPS-MACHINE LEARNING ON DEVOPS Project to automate machine learning model training using…

First Order Motion Model

Text Detection and Recognition without writing code using PyFlowOpenCv

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Justin Masayda

Justin Masayda

Software Engineering Student at Holberton School | Interested in DSP and audio programming | Jazz pianist | Electronic music producer

More from Medium

Convolutional Neural Networks — Summary of Krizhevsky et. al.‘s 2012 paper

Image Classification Using Convolutional Neural Networks

Convolutional Neural Networks — For Beginners

Note down the skeleton of the convolutional network (CNN)