Understanding Baby Llama2 Training – A Visual Design Walkthrough

In this post we explore how to train the Llama2 model using the Baby Llama2 created by Andrej Karpathy [1] which is based on his original minGPT model [2] and has the same basic transformer architecture that other generative AI models use, such as ChatGPT [3].  Generative transformer models employ a stack of transformer blocks which internally retain knowledge using weights that store knowledge learned via the Attention mechanism as described by Vaswani et. al. [4] In general transformer models have the following architecture.

Transformer Model

The transformer model is trained to understand natural language input in the form of text that is first tokenized by converting the text to numeric values that are references into a vocabulary dictionary. The tokenized values are used to train the network where the previous tokens predict the next token in the sequence. During the training process, the model first (1) creates an embedding of each token, (2) adds dropout to help the model generalize, and then (3) runs through the stack of 6 to 100+ transformer blocks, each of which contain the attention mechanism used to learn from the inputs. The transformer block results are (4) normalized to keep the model stable, and (5) converted into logits that correspond to each of the vocabulary items, which in this case comprise 32,000 vocabulary items. And finally, during training, (6) the logits are converted back to the predicted ‘next’ token values that are compared with the actual ‘next’ token values using the CrossEntopyLoss to produce the loss value. The loss value is then backpropagated through the network to calculate the gradients that are then applied to the weights using the optimizer thus facilitating the actual learning.

In this post, we will dive deep into the actual data pre-processing, the training process and model data flow occurring during the training.


When using Karpathy’s Baby Llama2, the pre-processing occurs in two steps. First the Tiny Stories dataset is downloaded from Hugging Face and extracted. Then the extracted data is converted to tokens in the pre-tokenizing process.

Downloading The Dataset

The Tiny Stories dataset by Eldan et. al. [5] is a collection of short, simple stories created using other versions of GPT, including both GPT 3.5 and GPT 4.0.  This dataset is used to train the Baby Llama model from scratch.

Downloading Tiny Stories Dataset

During the downloading process the following steps occur.

  1. The actual tar.gz raw data file is downloaded to the data directory relative to your Python project.
  2. Next, the 50 Json data files are extracted from the *.gz file and placed in the data/TinyStories_all_data
Json Data File Content

Each of the 50 Json files contains 100,000 short stories for a total of five million short stories in the entire dataset.

Dataset Json File Format

Within a single Json file (1) you will find the 100,000 short stories (2) each stored with the same format (3) containing the following entries as described by [5]:

Story: the story portion contains the actual short story text created using the source model, instructions, and summary.

Instruction: the instruction contains a prompt used to prepare the model, a list of words to be included in the story, and features describing the story’s attributes.

Summary: a summary of the story used to create the actual full story text along with the instructions.

Source: the source is the source model used to create and summarize the short story.

When training the Baby Llama2 model, only the story portion of the data is used.

Pre-Tokenizing The Data

Pre-tokenizing is the process of selecting and converting the raw text into tokens where a token is a numeric index value that references a specific sub-text value within the vocabulary. For example, a 26-character based vocabulary would have twenty-six entries and therefore token values from 0 to 25. As you can imagine, large language model vocabularies are much larger for they consist of not only the characters used but also numerous combinations of the characters used. A common tokenization algorithm used to build a vocabulary is called byte-pair encoding [6] [7].  This type of tokenization method is used by the Llama models, including Baby Llama which builds a vocabulary of 32,000 entries.

Pre-Tokenizing the Json Data

During tokenization, the following steps occur.

  1. For each Json file (e.g., json), …
  2. … the process_shard() function tokenizes each of the 100,000 short stories within the Json file.
  3. During tokenization, first the story text is extracted from the shard.
  4. Next, the story text is tokenized using the byte-pair tokenization process implemented by the SentencePieceProcessor which is part of SentencePiece.
  5. The tokens created from each shard are then added to the all_tokens
  6. Upon completing the tokenization of all shards, the all_tokens list is saved as a byte file to the corresponding data file. For example, the tokenized json shards are saved to the data49.bin binary data file.

Training Model from Scratch

When training from scratch, the model contains randomly initialized weights that have no learned knowledge. The process of training allows the model to learn the knowledge we seek and in the case of the GPT type models such as Llama2 and GPT, the model is trained to predict the next token given a set of input tokens. Keep in mind that input tokens may include the system prompt and user prompt, etc.

Training Process

During training, the pre-tokenized tokens are used to train the model with the following steps occurring.

  1. The Pretoke Dataset loads each pre-tokenized *.bin data file into a memory mapped file that is used when needed and randomly selects the starting point ‘r’ from within the file data. A seq_len of tokens is loaded from the ‘r’ starting point and stored as the current line in the batch being built. Each line of data is seq_len in length. To predict the next token, the Y tensors are offset one token forward in the sequence from that of the X input sequence.
  2. A grid of token sequences, each of seq_len in length, are stacked to create the X batch, and Y batch where each sequence in the Y batch is shifted by one token forward in corresponding X sequence of tokens.
  3. The X and Y batches of tokens are placed into Tensors of size (128,256) corresponding to a batch size of 128 and sequence length of 256.
  4. The X and Y batches are then sent through the evaluation loop that calculates the current training (using the X batch) and testing (using the Y batch) losses. This loop runs for one hundred iterations and the losses calculated are averaged.
  5. The losses are printed for output and if the test loss is better than the previous loss, the current model check-point weights are saved to disk.
  6. Next, the training loop is entered to train the model. To simulate a larger batch size, the training loop runs four cycles and accumulates the gradients at each step. These gradients will later be used to alter the model weights which is how the model leans.
  7. Next, the gradients are clipped to values no larger than 1.0 for model stability.
  8. And, then the optimizer is used to update the model weights using the learning rate, momentum, weight decay and the optimizer algorithm used. Baby Llama uses the AdamW [8]
  9. After updating the weights, the training loop continues and at certain iterations re-enters the evaluation loop, or otherwise continues to the next training loop.

Training continues until a sufficiently low and acceptable loss value is observed in the testing dataset that does not diverge from the training loss – which usually indicates overtraining.

Upon completion of the training, the trained weights are the same weights used by the Llama2.c inferencing discussed in our previous blog.

Baby Llama Model

The Baby Llama model is a scaled down version of the same model architecture used by the full Llama2 7B and Llama2 13B models. The main difference between the two models is that the number of layers and model dimensions are much smaller in the Baby Llama model.

Baby Llama Model

When processing the Baby Llama Model, the following steps occur.

  1. First, the input batch of tokens in the X tensor of input tokens is sent to the Embedding layer which converts each token into a 288-dimensional embedding. Embeddings are used to spatially separate each of the 32000 potential token values. The output of the embedding process is placed in the h tensor of shape (128, 256, 288) which corresponds to a batch size of 128, sequence length of 256 tokens, each with a dimension of 288 for the embedding. Note, the larger Llama2 models use an embedding size of 4096.
  2. Next, the h tensor is passed through a Dropout layer to help regularize the model.
  3. And then the h tensor is run through the stack of TransformerBlock layers that make up the foundation of the overall model. As you will see later, each TransformerBlock contains the attention mechanism as discussed by Vaswani et. al. [4], which is where the main knowledge learning of the model occurs. The main difference between the Baby Llama Model and larger models such as the 7B Llama2 model is that Baby Llama uses 6 TransformerBlocks corresponding to the 6 layers, whereas the 7B Llama2 Model uses 32 TransformerBlocks corresponding to its 32 layers. Also, note at this stage, the freqs_cos and freqs_sin tensors are fed into each TransformerBlock – these two tensors are used to add positional encoding to the Q and V tensors during CausalSelfAttention (discussed later).
  4. After the last TransformerBlock the h tensor is normalized with the RMSNorm [9] layer for model stability.
  5. Then the h tensor is processed by a straight Linear layer which converts the h tensor into the logits of shape (128,256,32000) where each of the 32000 values each represent a probability associated with the predicted token to be next within the vocabulary set.
  6. A CrossEntropyLoss [10] layer is used to calculate the loss which then feed into the backward pass to calculate the gradients that later update the weights to facilitate model learning.

As you can see the TransformerBlock layer is one of the key layers in the model as it makes up the main area of learning in the Transformer model.

TransformerBlock Layer

The TransformerBlock layer processes each h tensor by applying normalization, attention, and the final feed forward processing. This layer is the backbone of every transformer-based model.

TransformerBlock Layer

When processing tensors, the TransformerBlock layer takes the following steps.

  1. First the h tensor of size (128,256,288) is passed to the transformer block which treats this input tensor as the x
  2. Next, the x tensor is normalized using the RMSNorm layer with a dimension of 288, meaning that the last axis of each tensor is normalized. According to [9] a Root Mean Square Layer Normalization (RMSNorm) is computationally more efficient than Layer Normalization in that it “regularizes the summed inputs to a neuron in one layer according to the root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaption ability.”
  3. In the next step, the CausalSelfAttention applies the time positional embedding using RoPE [11] and then applies self-attention as described by Vaswani et. al. [4]
  4. The results from the CausalSelfAttention are then added to the h tensor as a residual.
  5. RMSNorm normalizes the h tensor for model stabilization…
  6. … and passes the results to the FeedForward final processing.

The FeedForward result is added to the h tensor as a residual and returned as an output to the TransformerBlock layer.

CausalSelfAttention Layer

The CausalSelfAttention layer applies attention to the input h tensor and during the process adds the positional information using the freqs_cos and freqs_sin tensors.

CausalSelfAttention Layer

When processing the causal self-attention, the following steps occur.

  1. First the input h tensor is sent to three separate and distinct Linear layers (w_q, w_k, and w_v) to produce the xq, xk, and xv
  2. The three xq, xk, and xv tensors are reshaped to allow for processing each of the 6 heads. After reshaping, each of these tensors has a shape of (128,256,6,48) where each of the 6 heads has a dimension of 48.
  3. The RoPE algorithm [11] is run on the xq and xv tensors which adds positional information that is essential for the attention processing.
  4. The xq, xk, and xv tensors are transposed so that attention can be run along each head. After the transposition, these tensors have a shape (128,6,256,48).
  5. Next, the ScaledDotProductAttention is run with the xq, xk, and xv tensors as input and outputs the out tensor of shape (128,6,256,48).
  6. The out tensor is then transposed…
  7. … and reshaped back to the shape (128,256,288)
  8. The reshaped out tensor is then run through a Linear layer…
  9. … and optionally a Dropout layer for model stabilization.
  10. The final output tensor of the CausalSelfAttention, out has a shape of (128,256,288).
ScaledDogProductAttention Function

The key layer within the CausalSelfAttention layer that performs the attention is called ScaledDotProductAttention. The main goal of attention is to show the model where to focus which helps these powerful models learn enormous amounts of knowledge.

ScaledDotProductAttention Function

During ScaledDotProductAttention processing, the following steps occur.

  1. First the xk input tensor of shape (128,6,256,48) is transposed along its last two axes into the new shape (128,6,28,256).
  2. Next, a MatMul operation is performed between the xq and xk tensors to produce the initial scores tensor of shape (128,6,256,256).
  3. The scores tensor is then scaled by 1/Sqrt(head_dim) for stability.
  4. Optionally, a mask tensor is added to the scaled_scores to mask out items we do not want attention applied to (such as future values).
  5. Softmax is run on the masked_scores, converting all values into probabilities that add up to 1.0. This produces the smx_scores
  6. Optionally, a Dropout layer is run on the masked_scores for model generalization.
  7. A MatMul is performed between the smx_scores and the xv values tensor to produce the out tensor of shape (128,6,256,48).
FeedForward Final Processing

The FeedForward layer is used to perform the final processing of the TransformerBlock using three internal Linear layers W1, W2 and W3 which process the input along with the Silu activation function.

FeedForward Processing

The following steps occur in the FeedForward processing.

  1. First the input tensor h is processed by the W3 Linear layer to produce h2 tensor with the new shape (128,256,768) where the 768 shape is derived from the linear layer input dimension of 288 using the following calculation that ensures the output is a multiple of the ‘multiple_of’ variable which is set to 32.

    For example, with an input dimension of 288, this function is:
  2. Next, the h tensor is processed by the W1 Linear layer to produce the h1 tensor of shape (128,256,768).
  3. The Silu activation is run on the h1 tensor to produce the h3.
  4. Element-wise multiplication is performed between the h2 and h3 tensors to produce the h4 tensor of shape (128,256,768).
  5. The h4 tensor is run through a final Linear layer…
  6. … and optional Dropout layer to produce…
  7. … the final output, the h tensor of shape (128,256,288).


In this post, we have shown how the Baby Llama model pre-processes its training data by pre-tokenizing the data. Next, we described the overall training process used to train the model from scratch and then described the model itself in detail. These designs were derived from Karpathy’s Python code found in the training.py and model.py files at [1].

To see how inferencing takes place using the Llama2.c, see our previous post.

Happy Deep Learning with LLMs!

[1] GitHub:karpathy/llama2.c, by Andrej Karpathy, 2023, GitHub

[2] GitHub:karpathy/minGPT, by Andrej Karpathy, 2022, GitHub

[3] GPT now supported with Transformer Models using CUDA 11.8 and cuDNN 8.6! by Dave Brown, 2022, SignalPop LLC

[4] Attention Is All You Need, by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, 2017, arXiv:1706.03762.

[5] TinyStories: How Small Can Language Models Be and Still Speak Coherent English? by Ronen Eldan and Yuanzhi Li, 2023, arXiv:2305.07759.

[6] Byte-Pair Encoding, Wikipedia

[7] Byte-Pair Encoding: Subword-based tokenization algorithm, by Chetna Khanna, 2021, Medium

[8] Decoupled Weight Decay Regularization, by Ilya Loshchilov and Frank Hutter, 2019, arXiv:1711.05101

[9] Root Mean Square Layer Normalization, by Biao Zhang and Rico Sennrich, 2019, arXiv:1910.07467

[10] Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names, by Raúl Gómez, 2018, Raúl Gómez blog.

[11] RoFormer: Enhanced Transformer with Rotary Position Embedding, by Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu, 2021, arXiv:2104.09864