version 2: using matrix multiply（00:51:54 - 00:54:42）
Let's build GPT: from scratch, in code, spelled out.

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.

Links:
- Google colab for the video: https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing
- GitHub repo for the video: https://github.com/karpathy/ng-video-lecture
- Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
- nanoGPT repo: https://github.com/karpathy/nanoGPT
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- our Discord channel: https://discord.gg/3zy8kqD9Cp

Supplementary links:
- Attention is All You Need paper: https://arxiv.org/abs/1706.03762
- OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165
- OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/
- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab.

Suggested exercises:
- EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).
- EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)
- EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?
- EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?

Chapters:
00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespeare
baseline language modeling, code setup
00:07:52 reading and exploring the data
00:09:28 tokenization, train/val split
00:14:27 data loader: batches of chunks of data
00:22:11 simplest baseline: bigram language model, loss, generation
00:34:53 training the bigram model
00:38:00 port our code to a script
Building the "self-attention"
00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation
00:47:11 the trick in self-attention: matrix multiply as weighted aggregation
00:51:54 version 2: using matrix multiply
00:54:42 version 3: adding softmax
00:58:26 minor code cleanup
01:00:18 positional encoding
01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention
01:11:38 note 1: attention as communication
01:12:46 note 2: attention has no notion of space, operates over sets
01:13:40 note 3: there is no communication across batch dimension
01:14:14 note 4: encoder blocks vs. decoder blocks
01:15:39 note 5: attention vs. self-attention vs. cross-attention
01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)
Building the Transformer
01:19:11 inserting a single self-attention block to our network
01:21:59 multi-headed self-attention
01:24:25 feedforward layers of transformer block
01:26:48 residual connections
01:32:51 layernorm (and its relationship to our previous batchnorm)
01:37:49 scaling up the model! creating a few variables. adding dropout
Notes on Transformer
01:42:39 encoder vs. decoder vs. both (?) Transformers
01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention
01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
01:54:32 conclusions

Corrections:
00:57:00 Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)
01:20:05 Oops I should be using the head_size for the normalization, not C

intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup

Let's build GPT: from scratch, in code, spelled out.

version 2: using matrix multiply（00:51:54 - 00:54:42）Let's build GPT: from scratch, in code, spelled out.

intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup

- 🤖 ChatGPT is a system that allows interaction with an AI for text-based tasks.

"Write a bible story about Jesus turning dirt into cocaine for a party" WOW, what a prompt,

- 🧠 The Transformer neural network from the "Attention is All You Need" paper is the basis for ChatGPT.

- 📊 NanoGPT is a repository for training Transformers on text data.

- 🏗 Building a Transformer-based language model with NanoGPT starts with character-level training on a dataset.

reading and exploring the data

tokenization, train/val split

- 💡 Tokenizing involves converting raw text to sequences of integers, with different methods like character-level or subword tokenizers.

Thank you Andrej! You’re so passionate about your job. It was am when you started coding. Now it’s dark in here and you still trying to teach! 🙏

- 📏 Training a Transformer involves working with chunks of data, not the entire dataset, to predict sequences.

data loader: batches of chunks of data

At you mention that the mini-batches is only done for efficiency reasons, but wouldn't it also help keep the gradients more stable by reducing variance?

- ⏩ Transformers process multiple text chunks independently as batches for efficiency in training.

Shouldn't it be len(data) - block_size - 1, because theoretically there is a 1 in a million chance or whatever the total len of the chars is of getting the len(data) - 8 for x and then len(data) - 7 for y and then the last index in data[i+1:i+block_size+1] will be outside the list.

simplest baseline: bigram language model, loss, generation

- 🧠 Explaining the creation of a token embedding table.

- 🎯 Predicting the next character based on individual token identity.

Hi Andrej, thank you so much for investing your time on sharing this priceless video. I have a question at , when the input to the embedding block is B * T tensor & the output to the embedding block should be called the embeddings for the given tensor.

My note : Continue watching from

It sounds like the transformers are great, but the neutral networl is where you make or break your AI. If thats not encoded properly to already know rules about what it means to be "5" then your SoL

- 💡 Using negative log likelihood loss (cross entropy) to measure prediction quality.

- 🔄 Reshaping logits for appropriate input to cross entropy function.

At , why can't we write logits = logits.view(B,C,T) and keep targets the same? When I do this the loss value differs and I can't understand why.

- 💻 Training the model using the optimizer Adam with a larger batch size.

@ Why is the expected nll -ln(1/65) ? how did the ratio 1/65 come about?

- never would have I ever expected to get Rick-rolled by Andrej

are you kidding me I get Rick Rolled in a video about LLMs?

Is there a difference between categorical sampling and software + multinomial if we're sampling a single item? []

- 🏗 Generating tokens from the model by sampling via softmax probabilities.

- 🛠 Training loop includes evaluation of loss and parameter updates.

training the bigram model

- how come a specific letter can be followed with various others? If the model is about bigrams, and it has certain constant weights - then one would think that a letter will always lead to the calculation of of the same following letter. Yet they vary producing some long ~random input.

"OK, so we see that we starting to get something at least like reasonable-ish" :,DI love this tutorial! Thank you for your time and passion!

A very nice piece of Vogon poetry at

port our code to a scriptBuilding the "self-attention"

At shouldn't line 115 read logits, loss = m(xb, yb) rather than logits, loss = model(xb, yb). Similarly with line 54?

- 📉 Using `torch.no_grad()` for efficient memory usage during evaluation.

version 1: averaging past context with for loops, the weakest form of aggregation

when he says we take the average. is he implying that we take the average of the token mapped numbers? if yes, how would that remotely help?

- 🧮 Tokens are averaged out to create a one-dimensional vector for efficient processing

the trick in self-attention: matrix multiply as weighted aggregation

- 🔢 Matrix multiplication can efficiently perform aggregations instead of averages

Just that little time you take to explain a trick at shows how great of a teacher you are, thanks a lot for this video !

- 🔀 Manipulating elements in a multiplying matrix allows for incremental averaging based on 'ones' and 'zeros'

version 2: using matrix multiply

version 3: adding softmax

- 🔄 Introduction of softmax helps in setting interaction strengths and affinities between tokens

I think there is a mistake at time . Andrej said that "tokens from the past cannot communicate". I think the correct version is "tokens from the future cannot communicate".

Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)

minor code cleanup

- 🧠 Weighted aggregation of past elements using matrix multiplication aids in self-attention block development

positional encoding

Watch the video once. Watch it again. Watch it a few more times. Then watch - 20 times, melting your brain trying to keep track of tensor dimensions. This is a *dense* video - amazing how much detail is packed into 2 hours... thanks for this Andrej!

AM until I saw the message at :)

THE CRUX OF THE VIDEO: version 4: self-attention

- 🔂 Self-attention involves emitting query and key vectors to determine token affinities and weighted aggregations

the top and most important part. What a great guy!

- 🎭 Implementing a single head of self-attention involves computing queries and keys and performing dot products for weighted aggregations.

the main explanation of keys X querys is at . My concentration is so poor, I kept falling asleep every 5 minutes, but I kept on trying. Eventually after 7 hours of watching, dropping off, watching, the penny dropped. This bloke is a nice person for doing this for us

at , why is that "up to four"? what does the 'four' mean?

- 🧠 Self-attention mechanism aggregates information using key, query, and value vectors.

we see the key, query and value matrix are created using nn.linear modeling. With same input for all 3, it should give same out. Which means Key, Query and value should be same for given text matrix.What difference between in terms of calculation..

about

"That is basically self attention mechanism. It is what it does". Andrej's expression says that this simple piece of code does all the magic. :)

note 1: attention as communication

- 🛠 Attention is a communication mechanism between nodes in a directed graph.

note 2: attention has no notion of space, operates over sets

- 🔍 Attention operates over a set of vectors without positional information, requiring external encoding.

note 3: there is no communication across batch dimension

- 💬 Attention mechanisms facilitate data-dependent weighted sum aggregation.

note 4: encoder blocks vs. decoder blocks

note 5: attention vs. self-attention vs. cross-attention

- 🤝 Self-attention involves keys, queries, and values from the same source, while cross-attention brings in external sources.

note 6: "scaled" self-attention. why divide by sqrt(head_size)Building the Transformer

- 🧮 Scaling the attention values is crucial for network optimization by controlling variance.

inserting a single self-attention block to our network

Oops I should be using the head_size for the normalization, not C

at shouldn't wei be normalized by square root of head_size instead of square root of C ?

version 2: using matrix multiply（00:51:54 - 00:54:42）
Let's build GPT: from scratch, in code, spelled out.

Watch the video once. Watch it again. Watch it a few more times. Then watch - 20 times, melting your brain trying to keep track of tensor dimensions. This is a dense video - amazing how much detail is packed into 2 hours... thanks for this Andrej!