Ten years after Alexnet ushered in a computer vision revolution, Albert van Breemen marvels at the progress that has been made since.
Ten years ago, the Alexnet algorithm premiered at the Imagenet computer vision challenge, forever changing the field of artificial intelligence. Before Alexnet, traditional computer vision algorithms achieved an error rate of 25 percent during the Imagenet challenge. Alexnet achieved an error rate of 16 percent. It was a huge step forward for the computer vision research community.
Alexnet was developed by Alex Krizhevsky, a computer vision scientist from the University of Toronto, and his team. Rather than being just an algorithm, it’s an approach to developing algorithms that we now call deep learning. The team employed artificial neural networks to build their algorithm, but in contrast to what was common at that time, they used an artificial neural network with millions of parameters.
This immediately introduced two problems. First, you need to have a huge data set to train the algorithm. Secondly, you need to have a lot of compute power. The first problem was solved by Imagenet, which is a database with over 14 million images. The second problem was tackled by following a trend that became popular around that time: general-purpose computing on graphics processing units, or GPGPU. The idea was to use GPUs to speed up computations. The combination of very large artificial neural networks that are trained using big data sets and GPUs is what we now call deep learning. It’s what has made so many new things possible ten years later.
Alexnet was the spark that started the recent AI fire. Since its inception, many new algorithms have been published, every new one outperforming the previous one. In the domain of image classification and object detection, we’ve seen algorithms such as VGG16, VGG19, Inceptionnet, Resnet, mask RCNN, Yolo, SSD, Unet, and many more.
The techniques used for Alexnet have also been used beyond image problems. By combining it with another AI technique called reinforcement learning, the company Deepmind created the Alphago algorithm that defeated the world go champion in 2016. Alphago, too, has been improved over the years and newer versions called Alphazero and Muzero have been developed that not only play go but also chess, shogi and Atari computer games on a world champion level.
Most of the successful deep learning algorithms are based on what we call a supervised learning paradigm. With supervised learning, every data point in your dataset needs to have an annotation to train the algorithm. During the learning process, the annotation is used to teach the algorithm the correct classification or object in the image. Creating annotations is done manually by humans and this is currently a weakness of deep learning algorithms.
Data is the bottleneck in many industrial applications as a lot of annotated data is needed to train very accurate algorithms. In practice, the cost of collecting and annotating one image is between 1 and 10 euros depending on the computer vision task. To achieve high accuracies, datasets with thousands or tens of thousands of images are needed. Not only is this very costly, but collecting and annotating data is also very time consuming.
To overcome this data bottleneck, researchers started to develop new learning techniques such as semi-supervised learning, active learning, self-supervised learning and unsupervised learning. The ambition behind all these techniques is that very limited or no annotated data at all is needed to train a deep learning algorithm. So far, results weren’t sufficient to do away with costly manual annotations.
Another approach to reducing data collection and annotation costs is to have data generated by another algorithm. Synthetic data is the approach when you first make a 3D model of the object you’re interested in and use software to generate automatically annotated image datasets. The difficulty with this approach is the sim-to-real gap: the performance of the algorithm in the real world depends on how well the images are rendered and how realistic they look.
More recently, great successes have been achieved in generative modeling. Generative models are algorithms that generate data. Up to last year, generative adversarial networks, or GANs, were state-of-the-art algorithms for generating images of, for instance, non-existing people. This year, algorithms based on diffusion models have been released, such as DALL-E, Imagen and Stable Diffusion. These models can translate a textual description of an image into a real picture. As the quality of those images is very high, they might be used to generate training sets for supervised learning algorithms and reduce data costs.
So now, only 10 years after Alexnet, we’re entering the area where one algorithm is generating the data for another to learn from. To speak with the famous words of the two-minute paper by Youtuber Károly Zsolnai-Fehér: what a time to be alive!