Controlling growth of activations in the residual stream（01:17:37 - 01:19:50）
Let's reproduce GPT-2 (124M)

We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.

Links:
- build-nanogpt GitHub repo, with all the changes in this video as individual commits: https://github.com/karpathy/build-nanogpt
- nanoGPT repo: https://github.com/karpathy/nanoGPT
- llm.c repo: https://github.com/karpathy/llm.c
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- our Discord channel: https://discord.gg/3zy8kqD9Cp

Supplementary links:
- Attention is All You Need paper: https://arxiv.org/abs/1706.03762
- OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI GPT-2 paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com

Chapters:
00:00:00 intro: Let’s reproduce GPT-2 (124M)
00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint
00:13:47 SECTION 1: implementing the GPT-2 nn.Module
00:28:08 loading the huggingface/GPT-2 parameters
00:31:00 implementing the forward pass to get logits
00:33:31 sampling init, prefix tokens, tokenization
00:37:02 sampling loop
00:41:47 sample, auto-detect the device
00:45:50 let’s train: data batches (B,T) → logits (B,T,C)
00:52:53 cross entropy loss
00:56:42 optimization loop: overfit a single batch
01:02:00 data loader lite
01:06:14 parameter sharing wte and lm_head
01:13:47 model initialization: std 0.02, residual init
01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms
01:39:38 float16, gradient scalers, bfloat16, 300ms
01:48:15 torch.compile, Python overhead, kernel fusion, 130ms
02:00:18 flash attention, 96ms
02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms
02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping
02:21:06 learning rate scheduler: warmup + cosine decay
02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms
02:34:09 gradient accumulation
02:46:52 distributed data parallel (DDP)
03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU)
03:23:10 validation data split, validation loss, sampling revive
03:28:23 evaluation: HellaSwag, starting the run
03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro
03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA
03:59:39 summary, phew, build-nanogpt github repo

Corrections:
I will post all errata and followups to the build-nanogpt GitHub repo (link above)

SuperThanks:
I experimentally enabled them on my channel yesterday. Totally optional and only use if rich. All revenue goes to to supporting my work in AI + Education.

#neural network #GPT #karpathy #LLM #language model #large language model #ChatGPT #NVIDIA #GPU #PyTorch #Python #deep learning #education

Controlling growth of activations in the residual stream（01:17:37 - 01:19:50）Let's reproduce GPT-2 (124M)

intro: Let’s reproduce GPT-2 (124M)

* *** Exploring the Target:* The video starts by loading the pre-trained GPT-2 (124M) model from Hugging Face Transformers and examining its weights and architecture.

- 🤖 Reproducing GPT-2 124M model- Reproducing the GPT-2 model involves understanding its release structure and model variations.

M) model, including its state dictionary and tensor shapes. We learn how the model's vocabulary size and embedding dimensions are represented within these tensors. (-

Reproducing the GPT-2 124M version

- 💻 Model Parameters Overview- GPT-2 miniseries comprises models of various sizes, with the 124 million parameter model being a significant variant.- Model parameters dictate its size, layer count, and channel dimensions, affecting downstream task performance.

- 💰 Reproducibility and Cost- Reproducing the GPT-2 124M model is now more accessible and affordable due to advances in hardware and cloud computing.- Achieving comparable model performance can be done in a relatively short time and at a reasonable cost.

Validation loss measures model's performance on unseen data.

exploring the GPT-2 (124M) OpenAI checkpoint

@ now... so far so good...

compared to the original Transformer are explored, such as the removal of the encoder and cross-attention mechanism. Further, modifications to layer normalization placement and the addition of a final layer normalization layer are highlighted. (-

- 🧠 Understanding Model Structure- Exploring the structure of the GPT-2 model involves inspecting token and positional embeddings, as well as layer weights.- The visualization of embeddings and weights reveals insights into the model's learning process and representation.

GPT-2 token and position embeddings explained

, aligning it with the schema used by Hugging Face Transformers. This skeleton includes modules for token and positional embeddings, Transformer blocks, final layer normalization, and the language model head. (-

Understanding token positions and embeddings in GPT-2 (124M)

has the freedom to learn the position embeddings (the original transformer paper hardcoded the positional embeddings)

is discussed. (-

Implementing and understanding GPT-2 (124M) model architecture.

SECTION 1: implementing the GPT-2 nn.Module

* *** Implementing the GPT-2 nn.Module:* A custom GPT-2 class is built in PyTorch, mirroring the Hugging Face architecture and loading the pre-trained weights for verification.

Creating a matching schema for loading weights easily.

's implementation through tensor manipulation and its algorithmic similarity to previous implementations. (-

You want a direct residual connection from the target to the input embeddings, skipping layer normalization (I need to understand what layer normalization is)

The Transformer involves repeated application of map and reduce

Implementing the Forward Pass and Text Generation: The forward pass of the network is implemented, outlining how input token indices are processed to produce logits for predicting the next token in a sequence. This sets the stage for generating text from the model. (-

Its funny how his description of attention as reduce-map description at can be thought of as map-reduce :)

the comparison between attention and mlp is impressive

GPT-2 used the 10h approximate version of G instead of the exact version.

Activation function GELU is an approximation

model. This involves tokenizing a prefix string, moving the model to a CUDA device for GPU acceleration, and performing sampling-based text generation. (-

GPT-2 (124M) implementation details

Efficient implementation in PyTorch for GPT-2 (124M) model

Introducing the Tiny Shakespeare Dataset: This part introduces the Tiny Shakespeare dataset as a small and manageable dataset for initial model training and debugging. Basic statistics of the dataset are explored. (-

loading the huggingface/GPT-2 parameters

Forwarding the GPT-2 model requires processing token indices and embeddings.

implementing the forward pass to get logits

* *** Forward Pass and Sampling:* The forward pass is implemented to calculate logits, and a sampling loop is added to generate text from the model.

model. It introduces the concept of batching and creating input-target pairs for loss calculation. (-

Explaining the forward pass of the GPT-2 network

sampling init, prefix tokens, tokenization

Creating a Simple Data Loader: This section refactors the code to create a simple data loader object responsible for loading tokenized data from the Tiny Shakespeare dataset and generating batches suitable for training the model. (-

Generating logits and probabilities for token prediction

sampling loop

(time )

why do we only keep the last column of the logits?

Using top K by default (50) helps keep the model on track

Calculating Loss and Backpropagation: The forward function is adjusted to return not just the logits but also the calculated loss based on provided target tokens. Cross-entropy loss is used, and the initial loss is sanity-checked to ensure reasonable starting probabilities. (-

- 🤖 Replicating GPT-2 Model Initialization- Replicating the GPT-2 model initialization process.- Transitioning from pre-trained weights to initializing from random numbers.- Exploring the straightforward process of using a random model in PyTorch.

sample, auto-detect the device

Using GPT-2 (124M) for model initialization

- 🔍 Detecting and Utilizing Device in PyTorch- Automatically detecting and utilizing available devices in PyTorch.- Strategies for choosing the highest compute-capable device.- Facilitating code compatibility across different hardware configurations.

Implementing Optimization with AdamW: This section introduces the AdamW optimizer as an alternative to stochastic gradient descent (SGD), highlighting its advantages for language model training. The optimization loop is implemented, including gradient accumulation and loss printing. (-

Initializing model on correct device is crucial for performance

let’s train: data batches (B,T) → logits (B,T,C)

- 📄 Preparing and Tokenizing Dataset- Introduction to the Tiny Shakespeare dataset for training.- Obtaining and processing the dataset for tokenization.- Initial exploration and preprocessing steps for training data.

Understanding and Addressing Device Mismatches: This part emphasizes the importance of ensuring all tensors and model components reside on the same device (CPU or GPU) to avoid errors during training. A bug related to tensor device mismatch is identified and corrected. (-

Transforming single sequence into batch with structured tokens

Creating input and labels for Transformer

model based on the original paper's guidelines. This includes using specific standard deviations for different layer types and scaling residual connections to control activation growth. (-

-- 🛠 Implementing Data Loader and Loss Calculation- Building a data loader to feed token sequences into the Transformer model.- Setting up the forward pass to calculate the loss function.- Establishing a structured approach for loss calculation and gradient updates.

cross entropy loss

Flattening multi-dimensional tensors for cross entropy calculation.

Calculating the estimated loss at initialization

GPU, focusing on its theoretical performance limits in terms of Teraflops for different floating-point precisions. The importance of memory bandwidth limitations is also discussed. (-

The loss at initialization is expected to be around 10.82 but is seen around 11, which suggests a diffused probability distribution at initialization.

Fun Fact: -ln(1/50257) = 10.82 but simply ln(50257) also gives the same answer.

optimization loop: overfit a single batch

Question regarding overfitting a single batch .

- 🧮 Optimizing Model Parameters with AdamW- Implementing optimization using the AdamW optimizer.- Understanding the role and benefits of AdamW compared to SGD.- Executing gradient updates and monitoring loss during the optimization process.

Pytorch library had bugs that the canonical version (e.g. Adam) is the buggy version (fixed in AdamW)

, are introduced as ways to trade precision for significant speed improvements. (-

Explaining the device issue and fixing tensor moving bug.

Attempting to overfit on a single example

Creating a simple data loader for iterating through batches of data.

data loader lite

- 📊 Implementation of a Simple Data Loader- The data loader reads text files and tokenizes them for model input.- It divides the data into batches, ensuring smooth iteration over the dataset.- Basic functionality covers chunking data and managing batch transitions.

precision in PyTorch to leverage tensor cores and achieve a substantial speedup in training without noticeable accuracy degradation. (-

Bug in GPT-2 training process

parameter sharing wte and lm_head

Controlling growth of activations in the residual stream（01:17:37 - 01:19:50）
Let's reproduce GPT-2 (124M)