The Real History of AI, Part 5: From the Transformer to ChatGPT (2017–2022) and a GPT-2 Case Study
ChatGPT is not the arrival of AI. It is the arrival of UX on top of a technology that had been growing for five years: BERT, GPT-1, GPT-2, GPT-3, InstructGPT. I know because in 2019 I built a commercial news-rewriting product on GPT-2 - three and a half years before the world 'discovered AI.'
From 2017 to 2022 AI traveled from the 'Attention Is All You Need' paper to ChatGPT - not on a new technology, but on five years of scaling and UX. Between the Transformer and ChatGPT lived BERT (2018), GPT-1 (2018), GPT-2 (2019), GPT-3 (2020), InstructGPT (2022), and finally ChatGPT (November 2022). Each step grew the model by one to two orders of magnitude and added one new trick. The 2017 architecture itself barely changed.
Key facts
- 2018: BERT (Google) - 340M parameters; GPT-1 (OpenAI) - 117M parameters. The first generation of transformers on natural language.
- 2019: GPT-2 (OpenAI) - 1.5B parameters. OpenAI declined to release weights 'due to misuse risks' - the first loud AI-safety narrative event.
- 2020: GPT-3 (OpenAI) - 175B parameters. A 100x scale-up from GPT-2 in 18 months.
- January 2022: InstructGPT - GPT-3 fine-tuned via RLHF to follow instructions. This - not GPT-3 itself - is the direct ancestor of ChatGPT.
- November 30, 2022: ChatGPT launched. 1 million users in 5 days, 100 million in 2 months - the fastest consumer-product growth in history.
The Final Five Years
We left part four at December 2017 - the publication of "Attention Is All You Need." This part covers the last five years of AI history before ChatGPT: 2018-2022, the years that turned an academic architecture into a product 100 million people signed up for in eight weeks.
The central claim of this part: ChatGPT was not a technology breakthrough. A product breakthrough on top of a technology that had been earning money in commercial startups for four years. I'm not arguing this from theory - I made money with this technology myself in 2019, story below.
2018: BERT and GPT-1 - Two Branches of One Family
In October 2018 Google released BERT (Bidirectional Encoder Representations from Transformers) - a 340M-parameter model trained to fill in masked words in text. BERT was an encoder: it looked at the whole sentence at once and was strong on context. By 2019, BERT was running inside Google search, processing roughly 10% of all queries.
In June 2018 OpenAI released GPT-1 (Generative Pre-trained Transformer) - a 117M-parameter model trained to predict the next word. GPT was a decoder: it generated text word by word. At launch, GPT-1 was an interesting academic paper, nothing more.
The two branches - encoder and decoder - grew in parallel. Until 2022 the industrial mainstream was on the BERT side (search, enterprise NLP, classification). After ChatGPT everything flipped: decoder-only models became the standard for everything.
2019: GPT-2 and the "Too Dangerous" Narrative
In February 2019 OpenAI announced GPT-2 - a 1.5B-parameter model, 13× larger than GPT-1. They paired the announcement with a loud move: the full weights would not be released for safety reasons. The model could allegedly generate news plausible enough for disinformation use.
The community split. Some called it reasonable caution; others called it a marketing stunt - manufactured controversy to draw attention. OpenAI rolled out increasingly large versions: 124M in February, 355M in May, 774M in August, and finally the full 1.5B in November 2019.
By the time the full model dropped, GPT-2 was usable by anyone with a laptop and a decent GPU. And that was when I tried it in a commercial project.
A Personal Anecdote: A Commercial News Rewriter on GPT-2 (2019)
In 2019 I was working on a project for a news aggregator. The task sounded simple: take raw feeds from wire agencies (market news, sports scores, weather, corporate press releases) and rewrite them into readable short stories in the publication's voice.
Until then this work was done by in-house rewriting editors: 5-10 minutes per story. The publication shipped about 200 rewrites a day, eating roughly three full-time editors.
I took GPT-2 large (774M parameters), fine-tuned it on five thousand "input" (raw feeds) and "output" (editor rewrites) pairs. Fine-tuning took a few hours on a single NVIDIA RTX 2080 Ti. The result:
- Time per story: 30 seconds (down from 5-10 minutes).
- Quality: on 70% of stories the editor accepted the output, on 25% they edited one or two sentences, on 5% they rewrote from scratch.
- Infrastructure cost: $200/month for a GPU server.
- Payback: one month.
This was September 2019. Three years and two months before ChatGPT "taught the world that AI copywriters exist." I invented no transformers and had no architectural insights. Took an open-source model, fine-tuned it on specific data, wired it into a pipeline. The most ordinary production NLP of 2019.
And there were thousands of people like me worldwide. Jasper (then Jarvis) launched in early 2021, Copy.ai in 2020, GitHub Copilot in August 2021 - all of it on GPT-3 via API. By the time ChatGPT launched in November 2022, dozens of commercial GPT-based products were already serving millions of users.
ChatGPT's main shift sat in accessibility, not in the technology. Before, you had to be a developer to get value from GPT. After November 30, 2022, you only had to open a website.
2020: GPT-3 and the Scaling Law
In May 2020 OpenAI announced GPT-3 - a 175B-parameter model, 117× larger than GPT-2. The main scientific result of the paper "Language Models are Few-Shot Learners" lived in the scaling law, not in the architecture (which barely changed): model quality grows predictably as you scale parameters, data, and compute.
GPT-3 also showed an unexpected property - few-shot learning. The model could solve novel tasks given only a few examples in the prompt, without any fine-tuning. Philosophically this was new: before GPT-3, every new task had required its own training.
In June 2020 OpenAI opened the GPT-3 API. First by waitlist, then from autumn 2021 to anyone. By early 2022, billions of API requests per month were flowing from thousands of startups.
January 2022: InstructGPT and the RLHF Magic
In January 2022 OpenAI published "Training language models to follow instructions with human feedback." The paper described InstructGPT - GPT-3 fine-tuned via RLHF (Reinforcement Learning from Human Feedback) to follow instructions.
Technically RLHF looks like this:
- Pretrain a base model on next-token prediction (already done with GPT-3).
- Collect a dataset: humans write instructions and exemplary answers. Fine-tune on them.
- For each prompt, generate several candidate answers. Have humans rank them best to worst.
- Train a reward model to predict those human rankings.
- Fine-tune the base model via PPO to maximize the reward.
The result: a 1.3B-parameter InstructGPT (100× smaller than GPT-3) produced answers humans preferred to GPT-3's (175B). Not because it was smarter. Because it had learned to answer what was actually asked instead of continuing the text in training-data style.
InstructGPT - not GPT-3 itself - is the direct ancestor of ChatGPT.
November 30, 2022: ChatGPT and the Product Explosion
On November 30, 2022, OpenAI launched ChatGPT. Technically it was GPT-3.5 (a variant of InstructGPT) with a chat interface. No new architectural ideas. Chat format instead of an API. Free access.
The effect was unprecedented:
- 5 days to 1 million users (Instagram took 2.5 months).
- 2 months to 100 million users (TikTok took 9 months).
And from this moment, in public consciousness, "AI was born." 90% of the mass audience encountered a large language model for the first time through ChatGPT - and decided the technology was new.
By that moment, in reality:
- The architecture (transformer) had been published 5 years earlier (2017).
- The base model (GPT-3) had been available via API for 2.5 years (since 2020).
- Similar models had been used in commercial products since 2019 (my GPT-2 case).
- BERT had been processing Google search queries since 2019.
- LSTM models had been generating text since 2015.
- word2vec had been running in production NLP since 2013.
ChatGPT was not the arrival of AI. The arrival of UX on top of AI - the moment the technology became as easy to use as Google Search.
What to Take From This Era (and From the Whole Series)
The main claims of Part 5:
- ChatGPT was a product breakthrough, not a technical one. The technology was ready by 2020. The convenient interface was the missing piece. When it appeared, the explosion happened.
- Every important piece of modern AI predates 2022. Transformer - 2017. GPT - 2018. Scaling - 2020. RLHF - 2022. UX wrapper - late 2022. Twenty years of work turned into "magic" the mass audience saw for the first time.
- Commercial business on large models worked at least three years before ChatGPT. I shipped on GPT-2 in 2019. Thousands of startups shipped on GPT-3 in 2020-2022. ChatGPT did not open commercial AI. It made it visible.
And most important - the thesis of the whole series:
- The history of AI does not start in November 2022. It starts in 1943, runs through two winters, fifteen years of invisible work in mail systems and search engines, the 2012 big bang - and arrives at ChatGPT as another step on the line, not a culmination. The line will not break. In ten years today's AI will look as simple as Last.fm circa 2007 looks now.
Whoever understands this history understands the future a little better. Because the next "big bang" is already happening - quietly, under another name, in the infrastructure, before marketing finds the right word for it. Exactly as with computer vision in 2005, recommender systems in 2007, and transformers in 2017.
ChatGPT surprised everyone. It shouldn't have. If the mass audience had known the eighty-year history, ChatGPT would be received as a routine next step in the line, not as a miracle - which is, in fact, exactly what it is.
Thanks for reading the series.
Frequently Asked Questions
What is the difference between BERT and GPT?
BERT (Google, 2018) is a bidirectional encoder: it sees the whole sentence at once and learns to fill in masked words. Strong at understanding (search, classification). GPT (OpenAI) is a unidirectional decoder: it predicts the next word from previous ones. Strong at generation. Until 2022, the BERT approach dominated industry (Google search, enterprise NLP), and GPT was the academic branch. ChatGPT flipped this - generative decoders became the new mainstream.
What is RLHF and why did it make ChatGPT possible?
RLHF (Reinforcement Learning from Human Feedback) is fine-tuning a model via human preferences. The model generates several candidate answers to a prompt; humans rank them best to worst; a reward model is trained on those rankings; the main model is fine-tuned via PPO to maximize reward. This technique turned GPT-3 (which 'just continued the text') into InstructGPT/ChatGPT (which 'follows instructions and answers helpfully').
Why did OpenAI delay releasing the GPT-2 weights in 2019?
OpenAI said it was a safety decision - the model could generate plausible news stories, which could be used for disinformation. Critics read it as marketing (manufactured controversy around the product). Nine months later OpenAI released the full model. This was the first widely covered instance of the 'this AI is too dangerous to release' narrative, which would repeat many times later.
What made ChatGPT different from anything before it?
Three things. Technically - almost nothing (it was GPT-3.5 with RLHF, available via API for a year already). Product-wise - a chat interface instead of an API: anyone could open chat.openai.com and talk to the model without writing a line of code. Marketing-wise - OpenAI made it free for the mass user, which created enormous organic virality. ChatGPT was not a technology breakthrough. It was a product breakthrough on top of an existing technology.
What commercial GPT use existed before ChatGPT?
Between the GPT-3 API launch (June 2020) and ChatGPT (November 2022), 30 months passed in which dozens of startups already shipped GPT-3 products. Jasper (formerly Jarvis), Copy.ai, Notion AI, GitHub Copilot (August 2021) - all of it ran on the same technology one or two years before 'everyone discovered AI.' I myself shipped a commercial GPT-2 news-rewriting product in 2019 - the story is below.
Keep reading
The Real History of AI, Part 4: The Deep-Learning Big Bang (2012–2017)
On September 30, 2012, deep learning stopped being an academic niche. AlexNet won ImageNet by a margin nobody had ever seen in the contest. Between that day and the December 2017 paper 'Attention Is All You Need' fit five years that contain almost all of modern AI's architectural magic - from word2vec to AlphaGo to GANs.
The Real History of AI, Part 3: Recommenders, Vision, and the Quiet Revolution (2000–2012)
By 2010, AI was already running inside every service you used: Netflix predicted your taste, Last.fm built your playlists, Facebook recognized friends in photos, and Gmail's spam filter blocked billions of messages a day. Nobody called it AI - it was 'big data' and 'machine learning'.
The Real History of AI, Part 2: Backprop, SVM, and the Second Winter (1980–2000)
In 1986 neural networks got a working learning algorithm - and most of the industry didn't notice. While the world watched expert systems collapse, OCR was already reading your mail at the post office, and SVMs were quietly winning every benchmark. The story of 'hidden AI' between the two winters.