Skip to content
Back to blog
AI Technologies

The Real History of AI, Part 4: The Deep-Learning Big Bang (2012–2017)

On September 30, 2012, deep learning stopped being an academic niche. AlexNet won ImageNet by a margin nobody had ever seen in the contest. Between that day and the December 2017 paper 'Attention Is All You Need' fit five years that contain almost all of modern AI's architectural magic - from word2vec to AlphaGo to GANs.

Mikhail SavchenkoMay 1, 20267 min read
AIHistoryDeep LearningAlexNetTransformers

From 2012 to 2017, AI went through its biggest technical explosion in half a century: AlexNet (2012) triggered the neural-architecture race, word2vec (2013) gave words numerical meaning, GANs (2014) taught networks to generate images, AlphaGo (2016) beat the world Go champion, and in December 2017 a paper titled 'Attention Is All You Need' described the transformer - the architecture ChatGPT would run on five years later.

Key facts

  • 2012: AlexNet won ImageNet with 15.3% top-5 error against 26.2% for the runner-up - a gap the contest had never seen.
  • 2014: Ian Goodfellow published the Generative Adversarial Network (GAN) paper - the technology that powered every image generator that followed.
  • 2015: Microsoft Research's ResNet broke human-level performance on ImageNet (3.57% top-5 error vs ~5% for humans).
  • 2016: DeepMind's AlphaGo beat Lee Sedol 4-1 at Go - a game considered out of reach of AI for at least another decade.
  • 2017: 'Attention Is All You Need' (Vaswani et al., Google) introduced the Transformer architecture - the foundation of every subsequent LLM, including GPT, Claude, and Gemini.

The Date Everything Changed

September 30, 2012 is a date worth remembering. The 2012 ImageNet results were published. The convolutional network AlexNet, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won with 15.3% top-5 error against 26.2% for the runner-up.

A 10.9 percentage-point gap in a contest where annual improvements were measured in tenths of a percent was an event of a different category. Within months, almost every computer-vision researcher migrated from SVMs to neural networks. Within two years, every serious startup had data scientists retraining themselves on deep learning. The big bang had begun.

This is part four of the AI history series - the five years that contain almost every architectural idea in modern AI.

2012: AlexNet, Five Days, Two GPUs

What was inside AlexNet that made it so powerful? Technically, three engineering decisions:

  1. Depth: 8 layers (5 convolutional + 3 fully connected) - three times deeper than LeCun's 1989 LeNet.
  2. GPU training: the entire network trained on two consumer NVIDIA GTX 580 (gaming graphics cards) for about five days. The first mass use of gaming hardware for a large ML task.
  3. Regularization: ReLU activations instead of sigmoids (six times faster convergence), dropout (randomly disabling neurons each step to prevent overfitting), data augmentation (cropping, flipping, color shifts).

None of these ideas was new on its own. ReLU had been discussed since the 2000s. Dropout was Hinton's 2012 idea. CNNs existed since 1989. What was new was the combination plus GPUs plus ImageNet. The 2012 magic was engineering, not mathematics.

2013: word2vec - Meaning From Statistics

In 2013 Tomáš Mikolov and colleagues at Google published word2vec - a technique for converting words into dense numerical vectors. The idea was startlingly simple: train a shallow neural network to predict neighboring words in text. The internal representations (embeddings) it produced had remarkable properties:

  • vector('king') − vector('man') + vector('woman') ≈ vector('queen')
  • vector('Paris') − vector('France') + vector('Italy') ≈ vector('Rome')

A neural network that had never explicitly been taught semantics had absorbed something like meaning from raw word co-occurrence statistics. This idea - meaning is distribution across contexts - became the foundation of every later language model. GPT, BERT, Claude - they all run on embeddings whose pedigree traces back to word2vec.

2014: GANs - Networks That Play Each Other

In 2014, graduate student Ian Goodfellow proposed Generative Adversarial Networks (GANs). The idea hit him in a Montreal bar discussion at night and was tested in code that same evening.

GAN architecture: two networks play a game. The generator takes random noise as input and tries to produce a plausible output (like a face). The discriminator receives either real data or fakes from the generator and tries to tell them apart. Both train at once: the generator learns to fool the discriminator; the discriminator learns to resist being fooled. Over time the generator produces ever more realistic samples.

By 2018, NVIDIA's StyleGAN was generating photorealistic faces of people who do not exist (recall thispersondoesnotexist.com). By 2019, the first deepfake videos appeared. Until diffusion models took over in 2020-2022, GANs were the dominant generative-AI technology.

2015: ResNet - A Network Deeper Than the Brain

In December 2015 the Microsoft Research team (He Kaiming et al.) published ResNet - a network with 152 layers. The key trick was residual connections, which let gradients pass through layers without vanishing.

Result: 3.57% top-5 error on ImageNet. For comparison, humans on the same task make around 5% errors. The moment a neural network first surpassed humans on a meaningful computer-vision benchmark.

By 2016 ResNet was the default backbone for every CV task - object detection, segmentation, face recognition. Residual connections then crossed into the 2017 transformer and into LLMs.

2016: AlphaGo and the Game Machines "Could Not" Win

Go was long considered out of reach for AI. Chess has about 30 reasonable moves on average; Go has about 200. The branching of the Go game tree gives more than 10^170 possible positions - more than the number of atoms in the observable universe. The brute force that worked in chess did not scale to Go under any 1990s-2000s technique.

In March 2016 AlphaGo from DeepMind (a Google company) beat Korean professional Lee Sedol 4-1 in a five-game match. Inside were three ingredients:

  • A convolutional network evaluating the position (value network).
  • A convolutional network proposing the next move (policy network).
  • Monte Carlo Tree Search (MCTS) guided by both networks.

The networks were trained first on human games, then playing themselves millions of times. In game two AlphaGo played move 37, which commentators called "a move no human would make." It became a famous instant of realization that the machine plays differently than we do. A year later AlphaGo Zero learned Go from scratch, without a single human game, and beat the original AlphaGo 100-0.

A Personal Anecdote: Watching AlphaGo Live

I remember the night of March 9, 2016. The first Lee Sedol vs AlphaGo game, I was watching the stream (with commentary translation) around two in the morning. I was sure Sedol would win - so was nearly every expert. AlphaGo won game one. Then game two. By game two the sport had stopped mattering; I was watching with the growing sense that something historical was happening live, in front of me.

A couple of days later I tried playing AlphaGo-style engines online (DeepMind itself didn't open access, but similar models on open weights appeared quickly). I'm an amateur, maybe 12-kyu, but even at that level the difference was clear: the machine did not play like a human. Not better or worse - differently. It played moves that Go literature had called weird or weak for centuries, and they worked.

I had never before watched AI discover rather than imitate. Six years later, programmers would have the same feeling watching Copilot write a non-trivial chunk of code. Eight years later, scientists using AlphaFold to predict protein structures. The line is the same.

December 2017: The Paper That Changed Everything

On June 12, 2017, eight researchers at Google Brain posted a preprint, and in December the paper appeared at NeurIPS. Title: "Attention Is All You Need." It described a new architecture for machine translation: the Transformer.

The transformer's idea: drop recurrent connections (RNN/LSTM), which trained painfully one step at a time. Instead, use the attention mechanism - each word in a sentence looks at every other word and decides how relevant they are to its own context. This gave two advantages:

  • Parallelism: a transformer trains on the whole sequence at once, not word by word.
  • Long-range dependencies: a word can directly "look at" any other word in the text, instead of relaying information through a chain of steps.

The 2017 paper was about machine translation. None of its authors at publication time predicted that five years later, this architecture would underpin ChatGPT, Claude, Gemini, and almost every major LLM in the world. Within a year Google would release BERT and OpenAI GPT-1. The race was on.

What to Take From This Era

The main claims of Part 4:

  1. The 2012 big bang was an engineering event, not a mathematical one. AlexNet combined ideas from the 1980s-2000s (CNN, ReLU, dropout) with 2010s GPUs and a 2009 dataset. When people say "deep learning was invented in 2012," they mean the explosion point, not the invention point.
  2. Every architecture today's AI runs on was invented inside this five-year window. AlexNet (CNN, 2012), word2vec (embeddings, 2013), GAN (generative networks, 2014), seq2seq + attention (2014-2015), ResNet (2015), Transformer (2017). Everything we now call "AI" is a variation on these architectures.
  3. AlphaGo proved AI can discover. Before 2016 the assumption was "AI can only repeat what's in the data." AlphaGo, playing itself and finding moves nobody had played in 4,000 years of Go history, buried that assumption.
  4. The 2017 transformer is a rare case of an architecture that worked immediately and stayed dominant. Over nine years (2017-2026), dozens of alternatives have been proposed (Mamba, RWKV, S4, RetNet, and so on). Every major model in the world as of 2026 is still a transformer. This is the longest architectural consensus in ML history.

In Part 5: the last five years - BERT and GPT, the scaling to GPT-3, InstructGPT, ChatGPT, and my own story - how in 2019 I built a commercial news-copywriting product on GPT-2, three and a half years before the world "discovered AI."

Frequently Asked Questions

What made AlexNet different from previous ImageNet contestants?

Three things. First, it was a deep (8-layer) convolutional network rather than an SVM with hand-engineered features. Second, it trained on two consumer NVIDIA GTX 580 GPUs - the first successful use of gaming hardware for a large ML task. Third, it used ReLU (instead of sigmoids), dropout (against overfitting), and data augmentation - three engineering tricks that became the new standard.

What is word2vec and why does it matter?

word2vec, introduced by Tomáš Mikolov at Google in 2013, is a technique that maps words into dense numerical vectors (often 300-dimensional) where geometric operations carry semantic meaning: vector('king') - vector('man') + vector('woman') ≈ vector('queen'). It was the first mass example of a neural network learning something like word meaning from pure co-occurrence statistics. Every subsequent NLP system rests on this idea.

What is a GAN and where is it used?

A Generative Adversarial Network, proposed by Ian Goodfellow in 2014, pits two networks against each other. The generator tries to produce plausible data (faces, say); the discriminator tries to tell real from fake. They train together and the quality of generated samples climbs. GANs powered StyleGAN (photorealistic faces), CycleGAN (style transfer), early deepfake video, and most generative AI until diffusion models took over in 2020-2022.

Why was AlphaGo such a big deal?

Before AlphaGo, computers could not beat a Go professional - the game has roughly 10^170 possible positions (more than atoms in the observable universe). Methods that worked in chess (minimax + alpha-beta) did not scale to Go. AlphaGo combined deep learning (two networks - value and policy) with Monte Carlo Tree Search and self-play. In March 2016 it beat Lee Sedol 4-1, an outcome considered out of reach for at least another decade.

If the Transformer was published in 2017, why did ChatGPT only ship in 2022?

Five years of engineering between paper and product. 2018 brought BERT (Google) and GPT-1 (OpenAI). 2019-2020 brought GPT-2 and GPT-3, which showed that scale produced qualitatively new properties. In 2022 OpenAI added instruction tuning and RLHF on top of GPT-3.5 and wrapped the result in a chat interface - that was ChatGPT. The 2017 architecture itself didn't change radically. What changed was training scale and behavioral fine-tuning.

Keep reading

The Real History of AI, Part 4: The Deep-Learning Big Bang (2012–2017) | INITE AI Blog