More
Template is not defined.

A Guide to How Generative AI Works (A Look Inside the Mind of GPT-3)

This article aims explain the GPT-3 (“Generative Pretrained Transformer 3.”) technology to everyone, without overhyping the topic. 

Share
Cute toddler-looking robot wearing a graduation cap

Chapters

Chapter 1

Introduction

Chapter 2

What Does 'GPT-3' Stand For?

Chapter 3

Breaking Down How it Works

Chapter 4

Limitations of GPT-3

Chapter 5

Conclusion

Chapter 6

Sources

Table of Contents

Chapter 1

Introduction

Editor’s Note: OpenAI released GPT-4 on March 14, 2023, but has kept details under wraps about its language model. While real world performance capabilities continue to grow and GPT-4’s abilities surpass GPT-3’s, the fundamental nature of the technology remains the same.

“I noticed you’re in charge of marketing operations. That’s great! I think you’ll love this email because your job sucks.”
— GPT-3

The sentence you see above wasn’t written by me. It was written entirely by OpenAI’s GPT-3 artificial intelligence language model, in one of my early (and pretty bad) attempts to craft the introduction to a sales email.

Probably not the most shining example of AI, I admit.

There’s lots of hype around GPT-3 today. I’m not here to add to the hype.

From where I stand, writing about GPT-3 online falls into two categories:

  • Amazement and Doom! “Artificial intelligence is here, and it’s coming for your jobs!”
  • Technotalk. “Here dmodel is the number of units in each bottleneck layer (we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 dmodel).” 

We need a better way to tell this story. This article aims explain the GPT-3 technology to everyone, without overhyping the topic. I’ll try to make state-of-the-art AI research accessible, without diluting insight, by explaining concepts as simply as possible.

The goal is to help you:

  • Build intuition as a marketer for the current state of technology today
  • Understand how the technology works on a conceptual level, so you can evaluate vendors better
  • Avoid being fooled by 90% of vendors that claim to use AI

Chapter 2

What Does 'GPT-3' Stand For?

GPT-3 stands for “Generative Pretrained Transformer 3.” That’s a mouthful, so let’s break it down:

Generative

“Generative” means that the language model is designed to “generate” words in an unsupervised manner. It was trained to predict the next word from a sequence of words. In other words, it uses existing text to create more text.

Pretrained

“Pretrained” means that the language model is trained with data “as is” out of the box. In the past, most language models needed to be trained on datasets which were too large and computationally expensive. This made it infeasible for most people to use.

Transformer

“Transformer” means that it uses the “transformer architecture” which we will cover below.

3 (Three)

“3” means that this is the third time OpenAI, the company which created GPT-3, is doing this.

Chapter 3

Breaking Down How it Works

GPT-3 is what we call a trained language model.

A trained language model generates text, based on input, which influences its output.

That’s all GPT-3 does.

Granted, it does it extremely well — but it’s not the AI singularity that tech blogs have been heralding lately.

Let’s take a closer look at what’s happening within a GPT-3 process. We’ll start by explaining what the G, P, T, and 3 stand for in GPT-3.

We’re going to do it in a slightly different order, though:

  • Starting with Pretrained
  • Then tackling Generative
  • Then addressing Transformer

At the end of this quick explanation, you’ll have a better understanding of how GPT-3 works.

Pretrained

GPT-3 is trained with 45 terabytes of text data found online, hailing from Wikipedia, books, the common web, etc. “Training” is the process of:

  • Exposing the model to lots of text
  • Getting the model to make predictions based on that text, and
  • Correcting the predictions so that it learns to make better predictions over time
This illustration represents how GPT-3 and other large language models are trained. After the LLM makes a prediction, that prediction is tested, graded, and fed back into the LLM so it can refine its prediction

The above example is run at scale, across millions of examples, millions of times. And the GPT-3 you see today is just one model, trained on the above process — which was estimated to cost OpenAI $12M and 355 years of computing time just to train.

Before GPT-3, if you wanted to build a specific system for a specific natural language processing task, you needed custom architecture, lots of specific training data, lots of hand-tuning, and lots of PhDs.

But GPT-3 provided a general purpose model that’s pretty good at many things out of the box (such as sci-fi writing, marketing copy, arithmetic, SQL generation, and more).

The interesting thing about GPT-3 is that its creators didn’t have an end use-case in mind. 

Which is a lot like how people are.

You didn’t come out of the womb really good at poker, or at running demand gen campaigns, or administering Marketo without breaking a sweat. But your brain is a general-purpose computer that can figure out how to get very good at that stuff with enough practice and intentionality.

 

So, is GPT-3 the AI singularity we have all been worried about? No. GPT-3 feels tantalizingly close to artificial general intelligence (AGI), but it’s not.

Generative

At the heart of it, GPT-3 is trained to predict (or “generate”) the next word from a sequence of words.

Illustration shows the sequence "input, prediction, output."

In the above simplified examples I’ve shared, you may have assumed that GPT-3 outputs entire phrases. For instance:

“All the world’s a stage,” -> “and all the men and women merely players.”

But that’s not entirely accurate.

GPT-3 actually generates one word at a time (‌to be precise, these are called “tokens,” but let’s just assume a token is a word for now).

How does GPT-3 make these predictions? By basically using a lot of matrix multiplication and linear algebra.

Wait. You can do linear algebra on words?

Yep. It’s pretty clever, actually. At a high-level, GPT-3‌ converts a word to what is called a vector (a list of numbers representing the word).

Here’s an example:

This is an example of a vector for the word "King." There are a series of numbers that indicate how closely the word King is associated with other words. When a computer is able to crunch and compare enough of these vector posiitions, and compare those positions against other words in a sentence, it can begin to derive mathematical predictions about other words that are likely to appear next

The above is how the word “King” is represented as a vector.

Ultimately, it’s just a list of 50 numbers. By comparing these numbers to other numbers from other words (word vectors), you can capture a lot of information, meaning and associations around these words.

An example of a 4-dimensional word vector.

Source: GloVe: Global Vectors for Word Representation (https://nlp.stanford.edu/pubs/glove.pdf).

This GIF shows the earlier GPT-2 model trying to complete some Shakespearean dialog. The results are a bit of a disaster.

This interactive tool shows how GPT-2 attempts to use predictions to build content. You can play with this tool here to appreciate how the technology has evolved. Compared to the latest tools, it feels pretty primitive.

In our Shakespeare example, we’re only interested in the first word it outputs (“women”), and as you can see, GPT-2 gets it right most of the time.

Does this look familiar?

This is similar to the “next-word prediction” feature on your smartphone keyboards.

The way your smartphone predicts the next word you may want to type is a very simple example of the "next word prediction" used by generative AI.

Think of GPT-3 generation capabilities as just a souped-up version of your smartphone keyboard’s word predictor.

To summarize, GPT-3 produces outputs by:

  1. Converting the input words into a vector (which is a list of numbers, as mentioned above)
  2. Putting it into a magic box (which we will talk more about later), doing some matrix multiplication, and calculating a prediction
  3. Converting the result (which is a vector) into a word
Illustration shows how GPT-3 converts words into vectors (a collection of numbers), then runs algorithmic predictions to predict which new vectors make sense, then converts those vectors into words.

Transformer

In June 2017, a bunch of really smart folks published a paper titled, “Attention Is All You Need.” In it, they introduced an architecture called a Transformer, which was quite “transformative” (pardon me) to the field of natural language processing.

Now I’m going to explain “self-attention” — a crucial component of the 2017 paper. Don’t worry: I’m oversimplifying things so everyone can understand it without specialized knowledge.

What is Self-Attention?

Imagine we have this input sentence that we want to process:

“The whale didn’t swim to the surface because it was too lazy.”

What does “it” in this sentence refer to? Is it referring to the surface or the whale? (Spoiler: It’s the whale.) This is a simple question for a toddler to answer … but it’s not that easy for an algorithm.

Self-attention enables our model to correctly associate the word “it” with “whale.” It enables a text model to “look at” every single word in the original sentence before making a decision about what word to include in its output.

This allows the AI to “bake” the understanding of other relevant words into the word it’s currently processing.

Here’s let’s see how GPT-3 maintains coherence of the new text it generates. After each token is produced, it’s added to the sequence of inputs:

Illustration shows how GPT-3 builds sequences of words.

This new sequence becomes the new input to the model in its next step. Rinse and repeat.

This is called “auto-regression,” and is one of the ideas which made Recurrent Neural Networks (RNNs) “unreasonably effective.”

This paper, “The Unreasonable Effectiveness of Recurrent Neural Networks,” captured my attention in May 2015, and was one of the founding inspirations for Saleswhale, which was acquired by 6sense in 2021 and evolved into Conversational Email.

This screenshot shows 6sense's Conversational Email tool, which is powered in part by the same large language model developed by Open.AI.

As an example, look at the prompts given to our Conversational Email tool for a company called LaunchDarkly:

“LaunchDarkly is a feature management platform. It enables dev and ops teams to control their whole feature lifecycle, from concept to launch to value. Feature flagging is an industry best practice of wrapping a new or risky section of code or infrastructure change with a flag.”

In this instance, words like “it” and “their” are referring to other words — specifically “LaunchDarkly” and “dev and ops teams.” There’s no way for the AI to understand or process this prompt without incorporating that context. When the model ingests this sentence, it has to know that:

  • The sentence refers to LaunchDarkly
  • “Their” refers to dev and ops teams
  • “With a flag” refers to the concept of feature flagging

This is what self-attention does. It bakes in the model’s understanding of relevant terms that explain the context of certain words before processing it.

Why are transformers such a large breakthrough? It’s the ability to scale up this process immensely through a concept called parallelization.

Before 2017, the way we used deep learning to understand text was by using models like Recurrent Neural Networks (RNN), which we mentioned briefly above.

This graph shows a visual representation of how recurrent neural networks work by considering an entire sequence of words before predicting the word that should follow.

Source: Wikipedia

An RNN basically takes an input (such as “The whale didn’t swim to the surface because it was too lazy”), processes the words one at a time, and then sequentially spits out the output.

But RNNs had issues. They struggled with large inputs of text, and by the time they’d get to the end of a sentence, they’d forget what had happened at its beginning. An RNN might forget that it was “the whale” who was too lazy and think it was “the surface” instead.

And the biggest problem with RNNs?  Because they processed words sequentially, it was hard to distribute the computing tasks to multiple processors. And as LLMs got larger and larger, the sheer amount of computation pushed through individual GPUs caused slowdowns in even the fastest machines. This slowdown also meant you couldn’t train them on much data.

Transformers changed everything. Transformers could be very efficiently when parallelized. And with enough hardware, you could train some really big models.

How big? Really, really big.

GPT-3 was trained on 45 terabytes (!) of text, which includes almost the entire entirety of the public internet. That’s mind-boggling, and brings us to the final part of this article.

3 (Three)

As mentioned above, the invention of Transformers gave us the ability to throw massive data and inputs at our models.

And GPT-3 is MASSIVE.

It encodes what it learns from training in 175 billion numbers (called parameters). These numbers are used to calculate which word token to generate at each run.

Think of parameters as dials on a radio that you move up and down to find the right signal.

To understand GPT-3, it helps to understand how GPT-2 came about.

A Wild Idea

In October 2019, Google released BERT, which overshadowed the release of OpenAI’s “GPT-1” (just called GPT back then). Instead of trying to beat BERT at its own game, the OpenAI team decided to take a different route.

Here’s how I imagine the conversation among the OpenAI team went down:

Engineer #1: “Hey, this BERT thing seems to be working better than our GPT idea. What do we do?”

Engineer #2: “We can try copying them…”

Engineer #3: “I know! Let’s just throw more GPUs and data at our model and call it GPT-2!”

All three together: “Awesome!”

We’ll never know if that conversation happened, but it’s essentially what the OpenAI team set out to do. (This is another gross simplification.) But the real question is: Did it work? Does cranking all the dials to 11 and throwing more money at the problem lead to better results?

The answer was surprisingly, yes.

They blew BERT, with an already impressive 340 million parameters, out of the water with GPT-2’s stupendous 1.5 billion parameters.

GPT-2 showed that it’s possible to improve the performance of a model by “simply” increasing the model size.

With the success of GPT-2… guess what happened next? Someone went even further, saying, “Hey… what if we scaled this even more? Like way more?”

And that’s how GPT-3 was born.

Two Orders of Magnitude Larger

GPT-3 is a MASSIVE, MASSIVE, 175 billion parameter model. That’s two orders of magnitude larger than the already huge GPT-2 model.

Think of GPT-3 as a much bigger GPT-2. An approach that was made possible by the transformer architecture that we covered above.

This bar graph shows that an early large language model called BERT has 340,000 parameters. GPT-2 has 1.5 million. GPT-3 has 175 million parameters. No reliable word yet on the size of GPT-4.
This line graph from Open.AI's paper about GPT-3 shows the larger language model is much better at determining context.

GPT-3 cost 200x more to train than GPT-2 — a whopping $12 million dollars in computation just to train the model.

GPT-3 showed the potential to “scale it till you break it” with the transformer architecture.

Chapter 4

Limitations of GPT-3

For all of its might and ability to appear human-like in its output, GPT-3 loses coherence, especially on longer texts. And if you read carefully, there can be a surrealist quality to the text that it generates.

Let’s compare GPT-3 to a toddler.

A toddler may have grammatically incorrect or broken speech, but it mostly has a coherent line of thought. It has some concept it wants to express, but it may not know the right words or grammar to express it.

GPT-3 is the exact opposite. It knows the words and the right grammar, but lacks an underlying coherent concept of it wants to express.

Here’s my favorite GPT-3 generated passage from a research paper. Even though it’s obviously incoherent, and easily identified as machine written, for some absurdist reason, I really like it:

Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm

Subtitle: Joaquin Phoenix pledged to not change for each awards event

Article: A year ago, Joaquin Phoenix made headlines when he appeared on the red carpet at the Golden Globes wearing a tuxedo with a paper bag over his head that read, “I am a shape-shifter. I can’t change the world. I can only change myself.” It was a promise to not change to fit into the Hollywood mold: “I think that’s a really special thing, to not change yourself. I think it’s a really special thing to say, ‘This is what’s inside of me, I’m proud of it, and I’m not going to be ashamed because of the way that someone else thinks I should be.’” Now, it’s the Oscars, and Phoenix is at it again. But this time, his publicist is saying he’ll be wearing a tux no matter what.

Megyn Kelly was not impressed, and she let him have it on The Tonight Show. “You know, I feel like, I feel like you could have worn the tux,” she says. “But you’re saying you’re a shape-shifter. I don’t know if you can change your tux, but you can change your mind. You can change your mind. You can change your mind.” Phoenix says he did, but it didn’t stick. “I was like, ‘Okay, I’m going to wear a tuxedo to this thing.’ And then I thought, ‘I don’t want to wear a tuxedo to this thing.’” Kelly goes on to encourage him to change his mind again, but Phoenix says it’s too late: “I’m committed to wearing this.”

This screenshot shows the same text referenced above, but within a computer terminal format.

Chapter 4

Conclusion

GPT-3 has captured the imagination of the AI community. An interesting discussion on a programming forum compared the human brain with where GPT-3 is now.

A typical human brain has over 100 trillion synapses, which is just another 3 orders of magnitude larger than the current GPT-3 175 billion parameter model.

It took OpenAI just slightly more than a year to increase their GPT model by 2 orders of magnitude (1.5 billion -> 175 billion).

What would happen if we had a trillion parameter model?

I guess we will find out in another year or two.

Chapter 5

Sources

Ready to see 6sense in action?

Picture of Gabriel Lim

Gabriel Lim