Building your own ChatGPT from scratch

In a world where technology constantly pushes the boundaries of human imagination, one phenomenon stands out: ChatGPT. You’ve probably experienced its magic, admired how it can chat meaningfully, and maybe even wondered how it all works inside. ChatGPT is more than just a program; it’s a gateway to the realms of artificial intelligence, showcasing the amazing progress we’ve made in machine learning.

At its core, ChatGPT is built on a technology called Generative Pre-trained Transformer (GPT). But what does that really mean? Let’s understand in this blog.

In this blog, we’ll explore the fundamentals of machine learning, including how machines generate words. We’ll delve into the transformer architecture and its attention mechanisms. Then, we’ll demystify GPT and its role in AI. Finally, we’ll embark on coding our own GPT from scratch, bridging theory and practice in artificial intelligence.

How does Machine learn?

Imagine a network of interconnected knobs—this is a neural network, inspired by our own brains. In this network, information flows through nodes, just like thoughts in our minds. Each node processes information and passes it along to the next, making decisions as it goes.

Each knob represents a neuron, a fundamental unit of processing. As information flows through this network, these neurons spring to action, analyzing, interpreting, and transmitting data. It’s similar to how thoughts travel through your mind—constantly interacting and influencing one another to form a coherent understanding of the world around you. In a neural network, these interactions pave the way for learning, adaptation, and intelligent decision-making, mirroring the complex dynamics of the human mind in the digital realm.

During the training phase of a neural network, we essentially guide it to understand patterns in data. We start by showing the network examples: we say, “Here’s the input, and here’s what we expect the output to be.” Then comes the fascinating part: we adjust/tweak these knobs, so that the network gets better at predicting the correct output for a given input.

As we tweak these knobs, our goal is simple: we want the network to get closer and closer to producing outputs that match our expectations. It’s like fine-tuning an instrument to play the perfect melody. Gradually, through this process, the network starts giving outputs that align more closely with what we anticipate. This adjustment process, known as backpropagation, involves fine-tuning the connections to align the network’s predictions with the provided input-output pairs. For understanding backpropagation better, you can refer to the following blog.

Coding Backpropagation and Gradient Descent From Scratch without using any libraries

Once our neural network has completed its training phase and learned the knob positions from the examples we provided, it enters the inference phase, where it gets to showcase its newfound skills.

During inference we freeze the adjustments we made to the knobs during training. Think of it as setting the dials to the perfect settings—and now the network is ready to tackle real-world tasks. When we present the network with new data, it springs into action, processing the input and swiftly generating an output based on what it’s learned.

Neural networks are versatile, capable of handling various tasks, from image recognition to natural language processing. By harnessing interconnected neurons, they unlock the potential of artificial intelligence, driving innovation across industries.

For a detailed understanding of how neural networks work, you can refer to the following CloudxLab playlist.

How does model output a word?

Now that we understand the concept of a neural network, let’s delve into its ability to perform classification tasks and how it outputs words.

Classification is like sorting things into different groups. Imagine we’re sorting pictures into two categories: pictures of cats and pictures of everything else (not cats). Our job is to teach the computer to look at each picture and decide which category it belongs to. So, when we show it a picture, it’ll say, “Yes, that’s a cat,” or “No, that’s not a cat.” That’s how classification works—it helps computers organize information into clear groups based on what they see.

Outputting a word is also a classification task. Let’s think of a big dictionary with lots of words—say, 50,000 of them. Now, imagine we have a smart computer that’s learning to predict the next word in a sentence. So, when we give it a sequence of words from a sentence, it guesses what word should come next.

But here’s the thing: computers think in numbers, not words. So, we turn each word into a special number, kind of like a token. Then, we train our computer to guess which number (or word) should come next in a sentence. When we give it some words, it looks at all the possibilities in the dictionary and assigns a chance (or probability) to each word, saying which one it thinks is most likely to come next.

Suppose we have the following sequences and their corresponding next words:

  1. Sequence: “The cat”, Next word: “sat”
  2. Sequence: “The cat sat”, Next word: “on”
  3. Sequence: “The cat sat on”, Next word: “the”
  4. Sequence: “The cat sat on the”, Next word: “mat”

During training, the neural network will learn from these patterns. It will understand that “The cat” is typically followed by “sat”, “The cat sat” is followed by “on”, “The cat sat on” is followed by “the”, and “The cat sat on the” is followed by “mat”. This way, the model learns the language structure and can predict the next word in a sequence based on the learned patterns. After training, our model will be good in predicting the next word in a sentence.

So, our computer’s job is to learn from lots of examples and get really good at guessing the next word in a sentence based on what it’s seen before. It’s like a super smart helper, trying to predict what word you’ll say next in a conversation.

In the above example, we have a dictionary (lookup) of n words. This means the neural network recognizes only these n words from the dictionary and can only produce predictions based on them. Any word not in the dictionary won’t be recognized or generated by the model.

Now, we provide the input “the cat and the dog ___”. We can see that each word is represented by a token in the lookup such as ‘the’ as 1, ‘cat’ as 5, ‘and’ as 2, etc. So we convert our input sequence to tokens using the lookup. Then we pass these tokens to the neural network, and it predicts the probability for each token, representing the chance of that token coming as the next word in the sequence. Then we choose the token with the highest probability, which in our case is token number 4. Upon performing a lookup, we find that token 4 represents the word “play”. So this becomes our output, and the sentence becomes “the cat and the dog play”.

In our example, with a limited vocabulary of ‘n’ words, the neural network can only predict the next word from the provided set of words. However, in large language models like ChatGPT, Bard, etc., the model is trained on a vast corpus of text data containing a diverse range of words and phrases from various sources. By training on a large dataset encompassing a wide vocabulary, the model becomes more proficient at understanding and generating human-like text. It learns the statistical relationships between words, their contexts, and the nuances of language usage across different domains.

When you give LLMs a query or a prompt, they predict the next word in the sequence. Once they generate a word, they then consider what word might come after that, and the process continues until the response is completed. This iterative prediction process allows these models to generate coherent and contextually relevant responses.

Let’s imagine the input prompt provided to ChatGPT is “Write a poem on nature.”Initially, the LLM might predict “The” as the first word. Then, considering “The” as the beginning of the poem, it might predict “beauty” as the next word, leading to “The beauty ____.” Continuing this process, it might predict “of” as the next word, resulting in “The beauty of ____.”

As the LLM predicts each subsequent word, the poem gradually takes shape. It might predict “nature” as the next word, leading to “The beauty of nature ____.” Then, it might predict “is” as the following word, resulting in “The beauty of nature is ____.”

The process continues until the LLM generates a coherent and evocative poem on nature. This iterative approach enables LLMs to create engaging and contextually relevant text based on the given prompt.

Recurrent Neural Networks

Imagine you’re reading a story, and you want to understand what’s happening as you go along. Your brain naturally remembers what you read before and uses that information to understand the story better. That’s kind of how recurrent neural networks work!

In simple terms, RNNs are like brains for computers. They’re really good at processing sequences of data, like words in a sentence or frames in a video. RNNs were introduced in 1980s. What makes them special is that they remember what they’ve seen before and use that memory to make sense of what’s happening next.

So, if you feed a sentence into an RNN, it’ll read one word at a time, just like you read one word after another in a story. But here’s the cool part: as it reads each word, it keeps a memory of what it read before. This memory helps it understand the context of the sentence and make better predictions about what word might come next.

While RNNs were great at processing sequences of data, they struggled with remembering long sequences. So, to address this issue, researchers came up with a special type of RNN called LSTM, which stands for Long Short-Term Memory. LSTMs are like upgraded versions of RNNs—they’re smarter and better at remembering important information from the past.

LSTMs performed better than RNNs to retain memory over long sequences, but still were not very good at the task. To address these challenges, researchers introduced the Transformer model.

For understanding RNNs and LSTMs in detail, you can refer to the following CloudxLab playlist.


The introduction of the Transformer marked a significant breakthrough in the field of Natural Language Processing. It emerged in the seminal paper titled “Attention is All You Need.”

The Transformer’s innovative design, leveraging self-attention mechanisms, addressed these shortcomings. By allowing the model to focus on relevant parts of the input sequence, the Transformer could capture long-range dependencies and contextual information more effectively. This breakthrough paved the way for more sophisticated language models, including ChatGPT, that excel in understanding and generating coherent text.

Self-Attention Mechanism

The basic idea: Each time the model predicts an output word, it only uses a part of the input where the most relevant information is concentrated instead of the entire sentence.

Suppose we have a sentence of n words:-

As we know machines only understand number, let’s map these words into vectors:

Now if we take a word vector Cᵢ and we want to compute the similarity of Cᵢ with every other vector, we take dot product of Cᵢ with every other vector in C₁ to Cₙ. If dot product is high, that means vectors are very similar.

To understand about word vectors, embeddings and how dot products represent similarity between two vectors, you can refer to

These dot products can be big or small, but they’re not really easy to understand on their own. So, we want to make them simpler and easier to compare. To do that, we use a trick called normalization. It’s like putting all these numbers on the same scale, from 0 to 1. This way, we can see which words are more similar to each other. The higher the number, the more similar the words.

Suppose dot(Cᵢ, C₁) is 0.7 and dot(Cᵢ, C₆) is 0.5. Then we can easily say that Cᵢ is more similar to C₁ than to C₆.

Now, imagine we have these nice numbers, but they’re still not exactly like probabilities (the chances of something happening). So, we use another trick called softmax. It helps us turn these numbers into something that looks more like probabilities.

Softmax basically adjusts the numbers so they all add up to 1, like percentages. This helps the computer understand how important each word is compared to the others. It’s like saying, “Out of all these words, which ones should we pay the most attention to?” Let’s call them attention scores.

Now, we want to use these attention scores to calculate a weighted sum of the original vectors C₁ to Cₙ. This weighted sum is called the context vector, and it gives us a representation of the input sentence that takes into account the importance of each word based on the attention scores. It provides a summary of the sentence that focuses more on the words that are deemed most relevant for the task at hand.

Let’s understand with an example

Suppose we have our input:- “I love Natural Language Processing“.

Step 1:- Let’s represent each word by a 3-dimensional vector. For instance:

  • “I”: [0.1, 0.2, 0.3]
  • “love”: [0.2, 0.3, 0.4]
  • “natural”: [0.3, 0.4, 0.5]
  • “language”: [0.4, 0.5, 0.6]
  • “processing”: [0.5, 0.6, 0.7]

Step 2:- We compute the dot product of each word with every other word. So dot product of “I” with each word will be:-

  1. “I” dot “I” = 0.1 * 0.1 + 0.2 * 0.2 + 0.3 * 0.3 = 0.01 + 0.04 + 0.09 = 0.14
  2. “I” dot “love” = 0.1 * 0.2 + 0.2 * 0.3 + 0.3 * 0.4 = 0.02 + 0.06 + 0.12 = 0.2
  3. “I” dot “natural” = 0.1 * 0.3 + 0.2 * 0.4 + 0.3 * 0.5 = 0.03 + 0.08 + 0.15 = 0.26
  4. “I” dot “language” = 0.1 * 0.4 + 0.2 * 0.5 + 0.3 * 0.6 = 0.04 + 0.1 + 0.18 = 0.32
  5. “I” dot “processing” = 0.1 * 0.5 + 0.2 * 0.6 + 0.3 * 0.7 = 0.05 + 0.12 + 0.21 = 0.38

So dot product vector of “I” becomes : [0.14, 0.2, 0.26, 0.32, 0.38]. In the same way we will calculate dot product of other words too.

Step 3:- Then we will normalize the dot product vector. Let’s skip it for this case. After normalising, we will apply the softmax function for dot product vector of each word.

Given a vector z = [z1, z2, z3,….. zn]​, the softmax function is defined as:

for i = 1,2,…,n.

So, on applying the formula for the word “I”, we receive the softmax result as: [0.17806, 0.18902, 0.19973, 0.21032, 0.22287]. So here, first value, that is, 0.17806 represents the attention score of the word “I” with itself. The second value, that is, 0.18902, represents the attention score of the word “I” with word “love”. In the same way, each score represents attention score of the word “I” with other words in the sequence. We can see that these probabilities sum to 1.

Step 4:- At last, we compute the weighted sum for each word by multiplying their respective attention scores to the word vector. So for “I”, it will look like:

For “I” and “I”, we will do 0.17806*[0.1, 0.2, 0.3] = [0.017806, 0.035612, 0.053418].

For “I” and “love”, it will be 0.18902 * [0.2, 0.3, 0.4] = [0.037804, 0.056706, 0.075608].

For “I” and “natural”, it will be 0.19973 * [0.3, 0.4, 0.5] = [0.059919, 0.079892, 0.099865].

For “I” and “language”, it will be 0.21032* [0.4, 0.5, 0.6] = [0.084128,0.10516,0.126192]

For “I” and “processing”, it will be 0.22287×[0.5,0.6,0.7] = [0.111435,0.133722,0.155009]

After this, we will sum all these vectors which will give us the context vector of “I”. So it will be [0.017806, 0.035612, 0.053418] + [0.037804, 0.056706, 0.075608] + [0.059919, 0.079892, 0.099865] + [0.084128,0.10516,0.126192] + [0.111435,0.133722,0.155009], which will be [0.310092,0.411992,0.510347].

So, context vector for “I” comes as: [0.310092,0.411992,0.510347]. In the same way we will calculate the context vector of all other words.

Remember, as we are calculating attention scores with respect to “I” here, “I” is referred to as query. Rest of the words are referred as key. When we calculate it for “love”, “love” will be called as query and rest of the words as key. All words are collectively called value. These terms are important as we will be using them while coding the transformer.

Now the question arises, why have we done all this?

As words, vectors don’t tell the relationship of a particular word with other words of the sentence, so it is not any better than a random subset of words. As we know, sentence is a group of words which makes sense. So we calculate their context vector which also keep the information about the relationship of Cᵢ with every other vector. This is called self-attention mechanism.

Transformer architecture

The transformer architecture is made of two blocks: Encoder(left) and Decoder(right). These encoder and decoder blocks are stacked N times.

  • Functionality: The encoder’s goal is to extract meaningful features and patterns from the input sequence, which can then be used by the decoder for generating the output sequence. It analyzes the input sequence, token by token, and generates contextualized representations for each token. These representations capture information about the token’s context within the input sequence.
  • Input: The encoder receives the input sequence, typically represented as a sequence of word embeddings or tokens.
  • Output: The encoder outputs a sequence of contextualized representations for each token in the input sequence.
  • Functionality: The decoder block is tasked with generating the output sequence based on the contextualized representations provided by the encoder. It’s task is to predict the next token in the output sequence based on the context provided by the encoder and the previously generated tokens. It generates the output sequence token by token, taking into account the learned representations and the context provided by the encoder.
  • Input: Initially, the decoder receives the same sequence of contextualized representations generated by the encoder.
  • Outputs(Shifted right): During training, the decoder also receives a shifted version of the output sequence, where each token is shifted to the right by one position. This shifted sequence is used for teacher forcing, helping the decoder learn to predict the next token in the sequence based on the previous tokens.
  • Output: The decoder generates the output sequence, which represents the model’s predictions or translations.
Positional Encoding

Consider the 2 following sentences:

> I do not like the story of the movie, but I do like the cast

> I do like the story of the movie, but I do not like the cast

What is the difference between these 2 sentences?

The words are same but the meaning is different. This shows that information of order is required to distinguish different meanings

Positional embedding generates embeddings which allows the model to learn the relative positions of words.

Now, as we have a brief overview of how the transformer works, let’s cover the components inside encoder and decoder blocks one by one. We’ll simultaneously code the components which will give us the final code of GPT.

Coding GPT from scratch

Let’s code it. Make sure you are comfortable with Tensorflow and Keras as we will be using it. You can access the complete code used in this blog at

Head(attention) block

So, we’ll start with implementing the Head block. In the context of transformer-based architectures, a “Head” refers to a distinct computational unit responsible for performing attention computations. It operates within the broader framework of self-attention, allowing the model to focus on relevant parts of the input sequence.

Let’s start with writing the __init__() method that sets up the necessary components and parameters required for attention computations.

class Head(tf.keras.layers.Layer):
    """ one head of self-attention """

    def __init__(self, head_size):
        super(Head, self).__init__()
        self.key = tf.keras.layers.Dense(head_size, use_bias=False)
        self.query = tf.keras.layers.Dense(head_size, use_bias=False)
        self.value = tf.keras.layers.Dense(head_size, use_bias=False)

        tril = tf.linalg.band_part(tf.ones((block_size, block_size)), -1, 0)
        self.tril = tf.constant(tril)

        self.dropout = tf.keras.layers.Dropout(dropout)

In the above code,

  • The key, query, and value layers are initialized as dense layers using the tf.keras.layers.Dense module. By initializing these layers without biases (use_bias=False), the model learns to capture complex relationships and patterns within the input sequence.
  • A lower triangular mask (tril) is generated using tf.linalg.band_part. This mask is essential for preventing the model from attending to future tokens during training, thereby avoiding information leakage. The lower triangular mask ensures that each position in the input sequence can only attend to positions/words preceding it. While training transformers, we pass the whole input sequence at once. So suppose, we have the following input sequence:

[<start>, I, love, natural, language, processing, <end>]

Now here we want to predict the word after “natural”. The lower triangular mask ensures that during training, our model can only attend to tokens that precede “natural” (i.e., <start>, ‘I’, ‘love’), masking out the words that come after it. This prevents the model from accessing future tokens, preserving the autoregressive nature of the task and ensuring that predictions are based solely on preceding context. It is only used in the decoder block and not the encoder block as while encoding we can access all the words but while decoding we cannot, because our task is to predict the next word.

  • In the end, we use a dropout layer initialized using tf.keras.layers.Dropout. Dropout regularization is applied to the attention weights during training to prevent overfitting and improve generalization performance.

Now, we will code the attention mechanism.

def call(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B, T, C = x.shape
        k = self.key(x)   # (B, T, hs)
        q = self.query(x) # (B, T, hs)

        # compute attention scores ("affinities")
        wei = tf.matmul(q, tf.transpose(k, perm=[0, 2, 1])) * tf.math.rsqrt(tf.cast(k.shape[-1], tf.float32))  # (B, T, T)
        wei = tf.where(self.tril[:T, :T] == 0, float('-inf'), wei)  # (B, T, T)
        wei = tf.nn.softmax(wei, axis=-1)  # (B, T, T)
        wei = self.dropout(wei)

        # perform the weighted aggregation of the values
        v = self.value(x)  # (B, T, hs)
        out = tf.matmul(wei, v)  # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out
  • The method receives input x, which is a tensor representing the input sequence. It assumes that the input has three dimensions: (batch_size, time_steps, channels).
  1. Batch Size: It’s the number of sequences processed together. For instance, if we process 32 movie reviews simultaneously, the batch size is 32.
  2. Time Steps: It’s the length of each sequence. In a movie review consisting of 100 words, each word is a time step.
  3. Channels: It’s the dimensionality of each feature in a sequence. If we represent words with 300-dimensional embeddings, each word has 300 channels.
  • Then it applies the key and query layers to the input tensor x, resulting in tensors k and q, both with shapes (batch_size, time_steps, head_size). Here head_size refers to the dimensionality of the feature space within each attention head. For example, if head_size is set to 64, it means that each attention head operates within a feature space of dimension 64.
  • It computes attention scores between the query and key tensors (q, k) using the dot product followed by normalization. The result is a tensor wei of shape (batch_size, time_steps, time_steps), where each element represents the attention score between a query and a key.
  • The lower triangular mask is applied to wei to prevent attending to future tokens, ensuring the autoregressive property of the model.
  • The softmax function is then applied along the last dimension to obtain attention weights, ensuring that the weights sum up to 1 for each time step.
  • After that, Dropout regularization is applied to the attention weights to prevent overfitting during training.
  • Then it applies the value layer to the input tensor x, resulting in a tensor v of shape (batch_size, time_steps, head_size). It performs a weighted sum of the value tensor v using the attention weights wei, resulting in the output tensor out of shape (batch_size, time_steps, head_size). This step computes the context vector, which represents the contextually enriched representation of the input sequence based on attention computations.

The Head block we implemented represents a single attention head within the Transformer architecture. It performs attention computations, including key, query, and value projections, attention score calculation, masking, softmax normalization, and weighted aggregation of values. Each Head block focuses on capturing specific patterns and relationships within the input sequence, contributing to the overall representation learning process of the model.

Now, let’s delve into the concept of multi-head attention.

Multi-Head attention Block

Multi-head attention is a key component of the Transformer architecture designed to enhance the model’s ability to capture diverse patterns and dependencies within the input sequence. Instead of relying on a single attention head, the model utilizes multiple attention heads in parallel. Each attention head learns different patterns and relationships within the input sequence independently. The outputs of the multiple attention heads are then concatenated or combined in some way to produce a comprehensive representation of the input sequence.

Why multi-head attention?

  • Capturing Diverse Patterns: Each attention head specializes in capturing specific patterns or dependencies within the input sequence, enhancing the model’s capacity to learn diverse relationships.
  • Improved Representation Learning: By leveraging multiple attention heads, the model can capture complex and nuanced interactions within the data, leading to more expressive representations.
  • Enhanced Robustness: Multi-head attention enables the model to learn from different perspectives simultaneously, making it more robust to variations and uncertainties in the input data.

Now, let’s code the multi-head attention.

class MultiHeadAttention(tf.keras.layers.Layer):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super(MultiHeadAttention, self).__init__()
        self.heads = [Head(head_size) for _ in range(num_heads)]
        self.proj = tf.keras.layers.Dense(n_embd)
        self.dropout = tf.keras.layers.Dropout(dropout)

    def call(self, x):
        out = tf.concat([h(x) for h in self.heads], axis=-1)
        out = self.dropout(self.proj(out))
        return out
  • num_heads and head_size are parameters passed to initialize the MultiHeadAttention layer. num_heads specifies the number of attention heads to be used in parallel and head_size determines the dimensionality of the feature space within each attention head.
__init__ Method:
  • In the __init__ method, we initialize the multiple attention heads by creating a list comprehension of Head instances. Each Head instance represents a single attention head with the specified head_size.
  • Additionally, we initialize a projection layer (self.proj) to aggregate the outputs of the multiple attention heads into a single representation.
  • A dropout layer (self.dropout) is also initialized to prevent overfitting during training.
call Method:
  • The call method takes the input tensor x and processes it through each attention head in parallel.
  • For each attention head in self.heads, the input tensor x is passed through the attention head, and the outputs are concatenated along the last axis using tf.concat.
  • The concatenated output is then passed through the projection layer self.proj to combine the information from multiple heads into a single representation.
  • Finally, dropout regularization is applied to the projected output to prevent overfitting.

In summary, the MultiHeadAttention class encapsulates the functionality of performing self-attention across multiple heads in parallel, enabling the model to capture diverse patterns and relationships within the input sequence. It forms a critical building block of the Transformer architecture, contributing to its effectiveness in various natural language processing tasks.

Feed-forward layer

The FeedForward layer in the Transformer architecture introduces non-linearity and feature transformation, essential for capturing complex patterns in the data. Through the ReLU activation function, it models non-linearities, aiding better representation learning. By projecting input features into higher-dimensional spaces and reducing dimensionality, it enhances the model’s ability to capture intricate dependencies and structures, fostering more expressive representations. Additionally, dropout regularization within the layer prevents overfitting by encouraging robust and generalizable representations, improving the model’s performance across diverse natural language processing tasks.

class FeedForward(tf.keras.layers.Layer):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super(FeedForward, self).__init__() = tf.keras.Sequential([
            tf.keras.layers.Dense(4 * n_embd),

    def call(self, x):
  • The FeedForward layer is initialized with the parameter n_embd, which specifies the dimensionality of the input and output feature spaces, or we can say shape of input and output tensor.
__init__ Method:
  • In the __init__ method, we define a simple feedforward neural network using tf.keras.Sequential.
  • The network consists of two dense layers:
    1. The first dense layer (tf.keras.layers.Dense(4 * n_embd)) projects the input features into a higher-dimensional space, followed by a rectified linear unit (ReLU) activation function (tf.keras.layers.ReLU()).
    2. The second dense layer (tf.keras.layers.Dense(n_embd)) reduces the dimensionality back to the original feature space.
  • Additionally, dropout regularization is applied using tf.keras.layers.Dropout(dropout) to prevent overfitting during training.
call Method:
  • The call method takes the input tensor x and passes it through the feedforward neural network defined in The output of the feedforward network is returned as the final result.

In summary, the FeedForward class implements a feedforward neural network layer within the Transformer architecture. It applies linear transformations followed by non-linear activations to process input features, enabling the model to capture complex patterns and relationships within the data. This layer contributes to the expressive power and effectiveness of the Transformer model in various natural language processing tasks.

Transformer Block

Now, let’s add all these components to form a transformer block

Transformer Block
class Block(tf.keras.layers.Layer):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        super(Block, self).__init__()
        head_size = n_embd // n_head = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.ln2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    def call(self, x):
        x = x +
        x = x + self.ffwd(self.ln2(x))
        return x
  • The Block class is initialized with two parameters: n_embd and n_head. n_embd specifies the dimensionality of the input and output feature spaces and n_head determines the number of attention heads to be used in the MultiHeadAttention layer.
  • Inside the __init__ method, we initialize the components of the Transformer block: MultiHeadAttention (, FeedForward (self.ffwd): Layer Normalization (self.ln1, self.ln2), represented by Add&Norm in the above diagram.
  • The call method of the Block class in the Transformer architecture processes the input tensor x through a series of transformations. Firstly, the input tensor undergoes the MultiHeadAttention layer (, followed by Layer Normalization (self.ln1). The resulting output is then added to the original input tensor to facilitate communication between different positions in the sequence. Subsequently, the augmented tensor from the previous step is passed through the FeedForward layer(self.ffwd), followed by another Layer Normalization (self.ln2). The output of the feedforward computation is again added to the augmented tensor.


Now, as we have designed the components of GPT, let’s stack them together to build our GPT.

class GPTLanguageModel(tf.keras.Model):

    def __init__(self):
        super(GPTLanguageModel, self).__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = tf.keras.layers.Embedding(vocab_size, n_embd)
        self.position_embedding_table = tf.keras.layers.Embedding(block_size, n_embd)
        self.blocks = tf.keras.Sequential([Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.lm_head = tf.keras.layers.Dense(vocab_size, kernel_initializer='normal', bias_initializer='zeros')

The GPTLanguageModel class defines a language model based on the Generative Pre-trained Transformer (GPT) architecture.

__init__ Method:
  • The __init__ method initializes the components necessary for the GPT language model.
  • self.token_embedding_table: This layer converts input tokens into dense vectors of fixed size (embedding vectors). Each token is mapped to a unique embedding vector in a lookup table.
  • self.position_embedding_table: This layer generates position encodings that represent the position of each token in the input sequence.
  • self.blocks: A sequence of Transformer blocks responsible for processing the input sequence. Each block comprises multi-head self-attention mechanisms and feedforward neural networks.
  • self.ln_f: Applies layer normalization to the final hidden states of the Transformer blocks. It stabilizes the training process by ensuring consistent distributions of hidden states across layers.
  • self.lm_head: A dense layer that maps the final hidden states of the Transformer blocks to logits over the vocabulary. Logits represent unnormalized probabilities of each token in the vocabulary being the next token in the sequence.

Let’s see these components in the transformer architecture.

Note:- The self.ln_f is not explicitly shown in the image.

Now let’s write the method which will perform the forward pass during our training phase.

call Method:
def call(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        pos_emb = self.position_embedding_table(tf.range(T, dtype=tf.float32))  # (T,C)
        x = tok_emb + pos_emb  # (B,T,C)
        x = self.blocks(x)  # (B,T,C)
        x = self.ln_f(x)  # (B,T,C)
        logits = self.lm_head(x)  # (B,T,vocab_size)

        if targets is None:
            loss = None
            B, T, C = logits.shape
            logits = tf.reshape(logits, (B * T, C))
            targets = tf.reshape(targets, (B * T,))
            loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)(targets, logits)

        return logits, loss
  • The call method takes idx and targets as input.
    • idx represents the input tensor containing integer indices of tokens. It has shape (batch_size, sequence_length).
    • targets represents the target tensor containing the indices of the tokens to be predicted. It has the same shape as idx.
  • tok_emb retrieves the token embeddings for the input indices from the token embedding table.
  • pos_emb generates position embeddings for each position in the input sequence using the position embedding table.
  • x = tok_emb + pos_emb: The token and position embeddings are added together to incorporate both token and positional information into the input representation x.
  • x = self.blocks(x): Then the input representation x is passed through the Transformer blocks (self.blocks), which process the sequence and extract relevant features.
  • x = self.ln_f(x): Layer normalization (self.ln_f) is applied to stabilize the training process by normalizing the hidden states of the Transformer blocks.
  • logits = self.lm_head(x): The final hidden states are passed through the output layer (self.lm_head), which generates logits for each token in the vocabulary.
  • If targets are provided, the method computes the loss using the sparse categorical cross-entropy loss function. It reshapes the logits and targets tensors to match the format required by the loss function.
  • If targets are not provided, the loss is set to None. That means we are not training the model but using it for prediction/text generation.
  • The method returns the logits and the computed loss (if applicable).

Now that we’ve explored the inner workings of the call method, let’s dive into another captivating feature of our Generative Pre-trained Transformer (GPT): the generate method. While the call method focuses on predicting the next character given a sequence, the generate method takes it a step further by generating entire sequences of text. It relies on the call method internally to predict each subsequent character, iteratively building the complete sequence.

generate Method:
def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = tf.nn.softmax(logits, axis=-1)  # (B, C)
            # sample from the distribution
            idx_next = tf.random.categorical(tf.math.log(probs), num_samples=1, dtype=tf.int64)  # (B, 1)
            # append sampled index to the running sequence
            idx = tf.concat([idx, idx_next], axis=1)  # (B, T+1)
        return idx
  • for _ in range(max_new_tokens): The method iterates for a specified number of max_new_tokens to generate new tokens based on the provided input sequence idx. max_new_tokens tells us about the number of tokens we want our GPT to generate.
  • idx_cond = idx[:, -block_size:]: Then it extracts the last block_size tokens from the input sequence idx to ensure that the model generates new tokens based on the most recent context. This cropping operation ensures that the model’s predictions are influenced by the most recent tokens.
  • logits, loss = self(idx_cond): Then the method invokes the model’s call method with the cropped input sequence idx_cond to obtain predictions for the next token in the sequence. The model generates logits, which are unnormalized probabilities, for each token in the vocabulary.
  • logits = logits[:, -1, :]: It selects only the logits corresponding to the last time step of the sequence, representing predictions for the next token to be generated. This step ensures that the model focuses on predicting the next token based on the most recent context.
  • probs = tf.nn.softmax(logits, axis=-1): Softmax activation is applied to the logits to convert them into probabilities. This softmax operation ensures that the model’s predictions are transformed into a probability distribution over the vocabulary, indicating the likelihood of each token being the next token in the sequence.
  • idx_next = tf.random.categorical(tf.math.log(probs), num_samples=1, dtype=tf.int64): It samples tokens from the probability distribution using the tf.random.categorical function, which randomly selects one token index from the probability distribution for each sequence in the batch. The log(probs) argument is used to stabilize the sampling process.
  • idx = tf.concat([idx, idx_next], axis=1): Then the sampled token indices are appended to the original input sequence idx, extending the sequence with the newly generated tokens.
  • This process repeats for each iteration of the loop, generating new tokens until the desired number of tokens (max_new_tokens) is reached.
  • Finally, the method returns the updated input sequence idx, which now includes the newly generated tokens, representing an extended sequence with additional context and predictions for future tokens.for _ in range(max_new_tokens):

In summary, the Generative Pre-trained Transformer (GPT) architecture employs advanced techniques like multi-head self-attention, feedforward neural networks, and layer normalization to understand and generate natural language text. With token and position embedding tables and a stack of Transformer blocks, GPT captures complex language patterns effectively.

Now, it’s time to train the GPT model on relevant datasets, fine-tune its parameters, and explore its capabilities across different tasks and domains. We’ll use the Shakespear dataset to train our GPT. This means our model will learn to generate text in the style of Shakespeare’s writings. You can find the dataset at

Let’s start with loading the dataset:

Loading the data

with open('input.txt', 'r', encoding='utf-8') as f:
    text =

Now, let’s create the character mappings so that we can convert the characters into numbers to feed it to machine.

chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

The above code initializes dictionaries for character-to-index and index-to-character mappings:

  • It extracts unique characters from the text and sorts them alphabetically.
  • Two dictionaries are created:
    • stoi: Maps characters to indices.
    • itos: Maps indices to characters.
  • Encoding (encode) and decoding (decode) functions are defined to convert between strings and lists of indices.

Now let’s divide our dataset into training and testing set.

# Train and test splits
data = tf.constant(encode(text), dtype=tf.int64)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

To streamline our data processing, we’ll break it down into manageable batches. This approach helps us efficiently handle large datasets without overwhelming our system resources. Let’s write a function to load our data in batches, enabling us to feed it into our model systematically and effectively.

# Data loading into batches
def get_batch(split):
    data_split = train_data if split == 'train' else val_data
    ix = tf.random.uniform(shape=(batch_size,), maxval=len(data_split) - block_size, dtype=tf.int32)
    x = tf.stack([data_split[i:i+block_size] for i in ix])
    y = tf.stack([data_split[i+1:i+block_size+1] for i in ix])
    return x, y

Now that we have our dataset ready, we need a function to calculate the loss. This loss function helps us understand how well our model is performing during training. By evaluating the loss, we can adjust our model’s weights using the backpropagation algorithm, which fine-tunes its parameters to minimize the loss and improve performance over time. Let’s craft a simple yet effective function to calculate the loss for our model.

Calculating Loss

# Calculating loss of the model
def estimate_loss(model):
    out = {}
    model.trainable = False
    for split in ['train', 'val']:
        losses = tf.TensorArray(tf.float32, size=eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses = losses.write(k, loss)
        out[split] = losses.stack().numpy().mean()
    model.trainable = True
    return out
  • The function starts by initializing an empty dictionary named out to store the loss values for both the training and validation splits.
  • It sets the trainable attribute of the model to False to ensure that the model’s parameters are not updated during the loss estimation process.
  • The function iterates over two splits: ‘train’ and ‘val’, representing the training and validation datasets, respectively.
  • Within each split, the function iterates eval_iters times. In each iteration, it retrieves a batch of input-output pairs (X, Y) using the get_batch(split) function.
  • For each batch, the model is called with inputs X and targets Y to obtain the logits and the corresponding loss.
  • The loss value for each iteration is stored in a TensorFlow TensorArray named losses.
  • Once all iterations for a split are completed, the mean loss value across all iterations is computed using the numpy().mean() method, and it is stored in the out dictionary with the corresponding split key.
  • After iterating over both ‘train’ and ‘val’ splits, the model’s trainable attribute is set back to True to allow further training if needed.
  • Finally, the function returns the dictionary out, containing the average loss values for both the training and validation splits.

Training the model

Now, let’s define the hyperparameters needed to configure our model training.

# hyperparameters
batch_size = 64
block_size = 256
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

# Set random seed

Now, let’s implement the training loop for our model. This loop iterates through the dataset, feeding batches of data to the model for training. Within each iteration, the model calculates the loss and updates its weights using the backpropagation algorithm. By repeating this process over multiple epochs, our model gradually learns to make accurate predictions and improve its performance. Let’s dive in and code the training loop for our model.

#Training the model. GPU is recommended for training.

model = GPTLanguageModel()
optimizer = tf.keras.optimizers.Adam(learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss(model)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    with tf.GradientTape() as tape:
        logits, loss = model(xb, yb)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
  • The GPTLanguageModel class is instantiated, creating an instance of the GPT language model.
  • Then an Adam optimizer is initialized with the specified learning rate (learning_rate).
  • The training loop iterates over a specified number of iterations (max_iters). During each iteration, the model’s performance is periodically evaluated on both the training and validation datasets.
  • In each iteration, a batch of data (xb, yb) is sampled from the training dataset using the get_batch function. This function retrieves input-output pairs for training.
  • The loss is computed by forward-passing the input batch (xb) through the model (model) and comparing the predictions with the actual targets (yb).
  • A gradient tape (tf.GradientTape) records operations for automatic differentiation, enabling the computation of gradients with respect to trainable variables.
  • Gradients of the loss with respect to the trainable variables are computed using tape.gradient.
  • The optimizer (optimizer) then applies these gradients to update the model’s trainable parameters using the Adam optimization algorithm.

With the completion of the training loop, our model has been trained using gradient descent optimization. Through iterations of parameter updates, it has learned to minimize the loss function, improving its ability to generate coherent and contextually relevant text. This training process equips the model with the knowledge and understanding necessary to perform various natural language processing tasks effectively.

step 0: train loss 4.5158, val loss 4.5177
step 500: train loss 1.9006, val loss 2.0083
step 1000: train loss 1.4417, val loss 1.6584
step 1500: train loss 1.2854, val loss 1.5992
step 2000: train loss 1.1676, val loss 1.5936
step 2500: train loss 1.0419, val loss 1.6674
step 3000: train loss 0.9076, val loss 1.8094
step 3500: train loss 0.7525, val loss 2.0218
step 4000: train loss 0.6012, val loss 2.3162
step 4500: train loss 0.4598, val loss 2.6565
step 4999: train loss 0.3497, val loss 2.9876

From the provided training log, we can observe several key insights:

  1. Training Progress: As the training progresses, both the training loss and validation loss decrease gradually. This indicates that our model is learning and improving its performance over time.
  2. Overfitting: Towards the end of the training process, we notice a discrepancy between the training loss and the validation loss. While the training loss continues to decrease, the validation loss starts to increase after a certain point. This divergence suggests that our model may be overfitting to the training data, performing well on the training set but struggling to generalize to unseen data represented by the validation set.
  3. Model Performance: The final validation loss provides insight into the overall performance of our model. A lower validation loss indicates better generalization and performance on unseen data. In this case, the validation loss seems relatively high, suggesting that our model may not be performing optimally.

Now, it’s important to note that the observed behavior in the training log, including the increasing validation loss towards the end of training, was intentionally introduced to highlight the phenomenon of overfitting. Overfitting occurs when a model learns to perform well on the training data but struggles to generalize to unseen data.

As part of your learning journey, it’s now your homework to address this issue and improve the model’s performance. You can explore various strategies to combat overfitting, such as adjusting the model architecture, incorporating regularization techniques, or increasing the diversity of the training data.

We have saved the weights of the model after 5000 iteration. You can directly use those to avoid the training phase as it can take a lot of time without GPU. The weights are present at:

# Initializing model with pre-trained weights. Use this if you don't want to re-train the model.
model = GPTLanguageModel()
dummy_input = tf.constant([[0]], dtype=tf.int32)

Now we will generate new text using the model.

# generate from the model
context = tf.zeros((1, 1), dtype=tf.int64)
generated_sequence = model.generate(context, max_new_tokens=500).numpy()
  • An initial context is set up using tf.zeros((1, 1), dtype=tf.int64). This initializes a tensor of shape (1, 1) with all elements set to zero, indicating the starting point for text generation.
  • The generate method of the trained model (model) is called to generate new text sequences based on the provided initial context. The max_new_tokens parameter specifies the maximum number of new tokens to generate in the text sequence.
  • The generated sequence is then decoded using a decoding function (decode) to convert the sequence of token IDs into human-readable text.

So, the output is:

Now keeps.
Can I know should thee were trans--I protest,
To betwixt the Samart's the mutine.

Ha, madam!
Sir, you!
You pitiff now, but you are worth aboards,
Betwixt the right of your ox adversaries,
Or let our suddenly in all severaltius free
Than Bolingbroke to England. Mercutio,
Ever justice with his praisence, he was proud;
When she departed by his fortune like a greer,
And in the gentle king fair hateful man.
Farewell; so old Cominius, away; I rather,
To you are therefore be behold

The generated text exhibits a level of coherence and structure reminiscent of Shakespearean language, suggesting that the model has effectively learned patterns from the Shakespearean text data it was trained on. The text includes elements such as archaic language, poetic imagery, and character interactions, which are characteristic of Shakespeare’s writing style.

Overall, the generated text demonstrates that the model is performing well in capturing the stylistic nuances and linguistic patterns present in the training data. It successfully produces text that resembles the language and tone of Shakespeare’s works, indicating that the model has learned to generate contextually relevant and plausible sequences of text.

You can save the model weights using:


In conclusion, we have delved into the architecture and training process of the Generative Pre-trained Transformer (GPT) model. We explored the intricacies of its components, and gained insights into its training dynamics. Through our journey, we identified challenges such as overfitting and discussed strategies to address them.

As we conclude, it’s important to remember that mastering machine learning models like GPT requires a combination of theoretical understanding, practical experimentation, and iterative refinement. By diving into the code, dataset, and pre-trained weights available at, you can further explore, experiment, and enhance your understanding of GPT and its applications. Embrace the learning process, and let curiosity guide you as you continue your exploration of the fascinating world of natural language processing and machine learning.