version 2: using matrix multiply(00:51:54 - 00:54:42) - Let's build GPT: from scratch, in code, spelled out.

version 2: using matrix multiply(00:51:54 - 00:54:42)
Let's build GPT: from scratch, in code, spelled out.

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend...
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.

Links:
- Google colab for the video: https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing
- GitHub repo for the video: https://github.com/karpathy/ng-video-lecture
- Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
- nanoGPT repo: https://github.com/karpathy/nanoGPT
- my website: https://karpathy.ai
- my twitter:
- our Discord channel: https://discord.gg/3zy8kqD9Cp

Supplementary links:
- Attention is All You Need paper: https://arxiv.org/abs/1706.03762
- OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165
- OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/
- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab.

Suggested exercises:
- EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).
- EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)
- EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?
- EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?

Chapters:
00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespeare
baseline language modeling, code setup
00:07:52 reading and exploring the data
00:09:28 tokenization, train/val split
00:14:27 data loader: batches of chunks of data
00:22:11 simplest baseline: bigram language model, loss, generation
00:34:53 training the bigram model
00:38:00 port our code to a script
Building the "self-attention"
00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation
00:47:11 the trick in self-attention: matrix multiply as weighted aggregation
00:51:54 version 2: using matrix multiply
00:54:42 version 3: adding softmax
00:58:26 minor code cleanup
01:00:18 positional encoding
01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention
01:11:38 note 1: attention as communication
01:12:46 note 2: attention has no notion of space, operates over sets
01:13:40 note 3: there is no communication across batch dimension
01:14:14 note 4: encoder blocks vs. decoder blocks
01:15:39 note 5: attention vs. self-attention vs. cross-attention
01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)
Building the Transformer
01:19:11 inserting a single self-attention block to our network
01:21:59 multi-headed self-attention
01:24:25 feedforward layers of transformer block
01:26:48 residual connections
01:32:51 layernorm (and its relationship to our previous batchnorm)
01:37:49 scaling up the model! creating a few variables. adding dropout
Notes on Transformer
01:42:39 encoder vs. decoder vs. both (?) Transformers
01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention
01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
01:54:32 conclusions

Corrections:
00:57:00 Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)
01:20:05 Oops I should be using the head_size for the normalization, not C
intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup - Let's build GPT: from scratch, in code, spelled out.

intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:00:00 - 00:07:52
-  🤖 ChatGPT is a system that allows interaction with an AI for text-based tasks. - Let's build GPT: from scratch, in code, spelled out.

- 🤖 ChatGPT is a system that allows interaction with an AI for text-based tasks.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:00:00 - 00:02:18
"Write a bible story about Jesus turning dirt into cocaine for a party" WOW, what a prompt, - Let's build GPT: from scratch, in code, spelled out.

"Write a bible story about Jesus turning dirt into cocaine for a party" WOW, what a prompt,

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Pikachu-iw1se 様 
00:01:01 - 01:56:20
-  🧠 The Transformer neural network from the "Attention is All You Need" paper is the basis for ChatGPT. - Let's build GPT: from scratch, in code, spelled out.

- 🧠 The Transformer neural network from the "Attention is All You Need" paper is the basis for ChatGPT.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:02:18 - 00:05:46
-  📊 NanoGPT is a repository for training Transformers on text data. - Let's build GPT: from scratch, in code, spelled out.

- 📊 NanoGPT is a repository for training Transformers on text data.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:05:46 - 00:07:23
-  🏗 Building a Transformer-based language model with NanoGPT starts with character-level training on a dataset. - Let's build GPT: from scratch, in code, spelled out.

- 🏗 Building a Transformer-based language model with NanoGPT starts with character-level training on a dataset.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:07:23 - 00:10:11
reading and exploring the data - Let's build GPT: from scratch, in code, spelled out.

reading and exploring the data

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:07:52 - 00:09:28
tokenization, train/val split - Let's build GPT: from scratch, in code, spelled out.

tokenization, train/val split

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:09:28 - 00:14:27
-  💡 Tokenizing involves converting raw text to sequences of integers, with different methods like character-level or subword tokenizers. - Let's build GPT: from scratch, in code, spelled out.

- 💡 Tokenizing involves converting raw text to sequences of integers, with different methods like character-level or subword tokenizers.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:10:11 - 00:13:36
Thank you Andrej! You’re so passionate about your job. It was  am when you started coding. Now it’s dark in here and you still trying to teach! 🙏 - Let's build GPT: from scratch, in code, spelled out.

Thank you Andrej! You’re so passionate about your job. It was am when you started coding. Now it’s dark in here and you still trying to teach! 🙏

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:11:00 - 01:56:20
-  📏 Training a Transformer involves working with chunks of data, not the entire dataset, to predict sequences. - Let's build GPT: from scratch, in code, spelled out.

- 📏 Training a Transformer involves working with chunks of data, not the entire dataset, to predict sequences.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:13:36 - 00:18:43
data loader: batches of chunks of data - Let's build GPT: from scratch, in code, spelled out.

data loader: batches of chunks of data

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:14:27 - 00:22:11
At  you mention that the mini-batches is only done for efficiency reasons, but wouldn't it also help keep the gradients more stable by reducing variance? - Let's build GPT: from scratch, in code, spelled out.

At you mention that the mini-batches is only done for efficiency reasons, but wouldn't it also help keep the gradients more stable by reducing variance?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Meru-v7f 様 
00:18:22 - 01:56:20
-  ⏩ Transformers process multiple text chunks independently as batches for efficiency in training. - Let's build GPT: from scratch, in code, spelled out.

- ⏩ Transformers process multiple text chunks independently as batches for efficiency in training.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:18:43 - 00:22:59
Shouldn't it be len(data) - block_size - 1, because theoretically there is a 1 in a million chance or whatever the total len of the chars is of getting the len(data) - 8 for x and then len(data) - 7 for y and then the last index in data[i+1:i+block_size+1] will be outside the list. - Let's build GPT: from scratch, in code, spelled out.

Shouldn't it be len(data) - block_size - 1, because theoretically there is a 1 in a million chance or whatever the total len of the chars is of getting the len(data) - 8 for x and then len(data) - 7 for y and then the last index in data[i+1:i+block_size+1] will be outside the list.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @ziga1122 様 
00:19:38 - 01:56:20
simplest baseline: bigram language model, loss, generation - Let's build GPT: from scratch, in code, spelled out.

simplest baseline: bigram language model, loss, generation

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:22:11 - 00:34:53
-  🧠 Explaining the creation of a token embedding table. - Let's build GPT: from scratch, in code, spelled out.

- 🧠 Explaining the creation of a token embedding table.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:22:59 - 00:24:09
-  🎯 Predicting the next character based on individual token identity. - Let's build GPT: from scratch, in code, spelled out.

- 🎯 Predicting the next character based on individual token identity.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:24:09 - 00:25:19
Hi Andrej, thank you so much for investing your time on sharing this priceless video. I have a question at , when the input to the embedding block is B * T tensor & the output to the embedding block should be called the embeddings for the given tensor. - Let's build GPT: from scratch, in code, spelled out.

Hi Andrej, thank you so much for investing your time on sharing this priceless video. I have a question at , when the input to the embedding block is B * T tensor & the output to the embedding block should be called the embeddings for the given tensor.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @harshmittal63 様 
00:24:13 - 01:56:20
My note : Continue watching from - Let's build GPT: from scratch, in code, spelled out.

My note : Continue watching from

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @lokeshwaranm244 様 
00:24:50 - 01:56:20
It sounds like the transformers are great, but the neutral networl is where you make or break your AI.  If thats not encoded properly to already know rules about what it means to be "5" then your SoL - Let's build GPT: from scratch, in code, spelled out.

It sounds like the transformers are great, but the neutral networl is where you make or break your AI. If thats not encoded properly to already know rules about what it means to be "5" then your SoL

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @KillianTwew 様 
00:24:52 - 01:56:20
-  💡 Using negative log likelihood loss (cross entropy) to measure prediction quality. - Let's build GPT: from scratch, in code, spelled out.

- 💡 Using negative log likelihood loss (cross entropy) to measure prediction quality.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:25:19 - 00:26:44
-  🔄 Reshaping logits for appropriate input to cross entropy function. - Let's build GPT: from scratch, in code, spelled out.

- 🔄 Reshaping logits for appropriate input to cross entropy function.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:26:44 - 00:28:22
At , why can't we write logits = logits.view(B,C,T) and keep targets the same? When I do this the loss value differs and I can't understand why. - Let's build GPT: from scratch, in code, spelled out.

At , why can't we write logits = logits.view(B,C,T) and keep targets the same? When I do this the loss value differs and I can't understand why.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @gauravfotedar 様 
00:27:45 - 01:56:20
-  💻 Training the model using the optimizer Adam with a larger batch size. - Let's build GPT: from scratch, in code, spelled out.

- 💻 Training the model using the optimizer Adam with a larger batch size.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:28:22 - 00:31:21
@ Why is the expected nll -ln(1/65) ? how did the ratio 1/65 come about? - Let's build GPT: from scratch, in code, spelled out.

@ Why is the expected nll -ln(1/65) ? how did the ratio 1/65 come about?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @anusuyanallathambi248 様 
00:28:26 - 01:56:20
- never would have I ever expected to get Rick-rolled by Andrej - Let's build GPT: from scratch, in code, spelled out.

- never would have I ever expected to get Rick-rolled by Andrej

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @hex6dec1mal 様 
00:28:27 - 01:56:20
are you kidding me I get Rick Rolled in a video about LLMs? - Let's build GPT: from scratch, in code, spelled out.

are you kidding me I get Rick Rolled in a video about LLMs?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @joevero4568 様 
00:28:31 - 01:56:20
Is there a difference between categorical sampling and software + multinomial if we're sampling a single item? [] - Let's build GPT: from scratch, in code, spelled out.

Is there a difference between categorical sampling and software + multinomial if we're sampling a single item? []

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @DiogoNeves 様 
00:30:01 - 01:56:20
-  🏗 Generating tokens from the model by sampling via softmax probabilities. - Let's build GPT: from scratch, in code, spelled out.

- 🏗 Generating tokens from the model by sampling via softmax probabilities.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:31:21 - 00:34:38
-  🛠 Training loop includes evaluation of loss and parameter updates. - Let's build GPT: from scratch, in code, spelled out.

- 🛠 Training loop includes evaluation of loss and parameter updates.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:34:38 - 00:41:23
training the bigram model - Let's build GPT: from scratch, in code, spelled out.

training the bigram model

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:34:53 - 00:38:00
- how come a specific letter can be followed with various others? If the model is about bigrams, and it has certain constant weights - then one would think that a letter will always lead to the calculation of of the same following letter. Yet they vary producing some long ~random input. - Let's build GPT: from scratch, in code, spelled out.

- how come a specific letter can be followed with various others? If the model is about bigrams, and it has certain constant weights - then one would think that a letter will always lead to the calculation of of the same following letter. Yet they vary producing some long ~random input.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @FreakyStyleytobby 様 
00:37:00 - 01:56:20
"OK, so we see that we starting to get something at least like reasonable-ish" :,DI love this tutorial! Thank you for your time and passion! - Let's build GPT: from scratch, in code, spelled out.

"OK, so we see that we starting to get something at least like reasonable-ish" :,DI love this tutorial! Thank you for your time and passion!

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @karakson 様 
00:37:18 - 01:56:20
A very nice piece of Vogon poetry at - Let's build GPT: from scratch, in code, spelled out.

A very nice piece of Vogon poetry at

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @DaveJ6515 様 
00:37:32 - 01:56:20
Question: ❓at min  you say that with the first bigram model predicts starting only from the previous character, but I see that the first word is POPSousthe.... now, if after the first P comes an O, but after the following P comes an S... where is this variation coming from?  Some other people has an answer? - Let's build GPT: from scratch, in code, spelled out.

Question: ❓at min you say that with the first bigram model predicts starting only from the previous character, but I see that the first word is POPSousthe.... now, if after the first P comes an O, but after the following P comes an S... where is this variation coming from? Some other people has an answer?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @FilippoBasso73 様 
00:37:48 - 01:56:20
port our code to a scriptBuilding the "self-attention" - Let's build GPT: from scratch, in code, spelled out.

port our code to a scriptBuilding the "self-attention"

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:38:00 - 00:42:13
At  shouldn't line 115 read logits, loss = m(xb, yb) rather than logits, loss = model(xb, yb). Similarly with line 54? - Let's build GPT: from scratch, in code, spelled out.

At shouldn't line 115 read logits, loss = m(xb, yb) rather than logits, loss = model(xb, yb). Similarly with line 54?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @pennyfarthingchapel 様 
00:38:50 - 01:56:20
-  📉 Using `torch.no_grad()` for efficient memory usage during evaluation. - Let's build GPT: from scratch, in code, spelled out.

- 📉 Using `torch.no_grad()` for efficient memory usage during evaluation.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:41:23 - 00:45:59
version 1: averaging past context with for loops, the weakest form of aggregation - Let's build GPT: from scratch, in code, spelled out.

version 1: averaging past context with for loops, the weakest form of aggregation

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:42:13 - 00:47:11
when he says we take the average. is he implying that we take the average of the token mapped numbers? if yes, how would that remotely help? - Let's build GPT: from scratch, in code, spelled out.

when he says we take the average. is he implying that we take the average of the token mapped numbers? if yes, how would that remotely help?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @FreakyBaguette 様 
00:45:20 - 01:56:20
-  🧮 Tokens are averaged out to create a one-dimensional vector for efficient processing - Let's build GPT: from scratch, in code, spelled out.

- 🧮 Tokens are averaged out to create a one-dimensional vector for efficient processing

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:45:59 - 00:47:22
the trick in self-attention: matrix multiply as weighted aggregation - Let's build GPT: from scratch, in code, spelled out.

the trick in self-attention: matrix multiply as weighted aggregation

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:47:11 - 00:51:54
-  🔢 Matrix multiplication can efficiently perform aggregations instead of averages - Let's build GPT: from scratch, in code, spelled out.

- 🔢 Matrix multiplication can efficiently perform aggregations instead of averages

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:47:22 - 00:50:27
Just that little time you take to explain a trick at  shows how great of a teacher you are, thanks a lot for this video ! - Let's build GPT: from scratch, in code, spelled out.

Just that little time you take to explain a trick at shows how great of a teacher you are, thanks a lot for this video !

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @to-grt 様 
00:47:30 - 01:56:20
-  🔀 Manipulating elements in a multiplying matrix allows for incremental averaging based on 'ones' and 'zeros' - Let's build GPT: from scratch, in code, spelled out.

- 🔀 Manipulating elements in a multiplying matrix allows for incremental averaging based on 'ones' and 'zeros'

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:50:27 - 00:54:51
version 2: using matrix multiply - Let's build GPT: from scratch, in code, spelled out.

version 2: using matrix multiply

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:51:54 - 00:54:42
version 3: adding softmax - Let's build GPT: from scratch, in code, spelled out.

version 3: adding softmax

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:54:42 - 00:58:26
-  🔄 Introduction of softmax helps in setting interaction strengths and affinities between tokens - Let's build GPT: from scratch, in code, spelled out.

- 🔄 Introduction of softmax helps in setting interaction strengths and affinities between tokens

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:54:51 - 00:58:27
I think there is a mistake at time . Andrej said that "tokens from the past cannot communicate". I think the correct version is "tokens from the future cannot communicate". - Let's build GPT: from scratch, in code, spelled out.

I think there is a mistake at time . Andrej said that "tokens from the past cannot communicate". I think the correct version is "tokens from the future cannot communicate".

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @gegao9066 様 
00:56:59 - 01:56:20
Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :) - Let's build GPT: from scratch, in code, spelled out.

Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:57:00 - 01:20:05
minor code cleanup - Let's build GPT: from scratch, in code, spelled out.

minor code cleanup

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
00:58:26 - 01:00:18
-  🧠 Weighted aggregation of past elements using matrix multiplication aids in self-attention block development - Let's build GPT: from scratch, in code, spelled out.

- 🧠 Weighted aggregation of past elements using matrix multiplication aids in self-attention block development

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
00:58:27 - 01:02:07
positional encoding - Let's build GPT: from scratch, in code, spelled out.

positional encoding

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:00:18 - 01:02:00
Watch the video once.  Watch it again.  Watch it a few more times.  Then watch  - 20 times, melting your brain trying to keep track of tensor dimensions.  This is a *dense* video - amazing how much detail is packed into 2 hours... thanks for this Andrej! - Let's build GPT: from scratch, in code, spelled out.

Watch the video once. Watch it again. Watch it a few more times. Then watch - 20 times, melting your brain trying to keep track of tensor dimensions. This is a *dense* video - amazing how much detail is packed into 2 hours... thanks for this Andrej!

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @solaxun 様 
01:01:00 - 01:11:00
AM until I saw the message at   :) - Let's build GPT: from scratch, in code, spelled out.

AM until I saw the message at :)

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @muthuraja4172 様 
01:01:58 - 01:56:20
THE CRUX OF THE VIDEO: version 4: self-attention - Let's build GPT: from scratch, in code, spelled out.

THE CRUX OF THE VIDEO: version 4: self-attention

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:02:00 - 01:11:38
-  🔂 Self-attention involves emitting query and key vectors to determine token affinities and weighted aggregations - Let's build GPT: from scratch, in code, spelled out.

- 🔂 Self-attention involves emitting query and key vectors to determine token affinities and weighted aggregations

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:02:07 - 01:05:13
the top and most important part. What a great guy! - Let's build GPT: from scratch, in code, spelled out.

the top and most important part. What a great guy!

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @jose-kj3fg 様 
01:03:53 - 01:56:20
-  🎭 Implementing a single head of self-attention involves computing queries and keys and performing dot products for weighted aggregations. - Let's build GPT: from scratch, in code, spelled out.

- 🎭 Implementing a single head of self-attention involves computing queries and keys and performing dot products for weighted aggregations.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:05:13 - 01:10:10
You introduced nn.Linear() at , but that confused me. So, I looked into the PyTorch nn.Linear documentation. Still, I was not clear. The ambiguous point is that it looks like the following are identical calls:key    = nn.Linear(C, head_size, bias=False)value = nn.Linear(C, head_size, bias=False)Then I expect the dot product of key(x), value(x) to be the same as the dot product of key(x), key(x).Thanks to your collab code, I found that when I changed the seed value, the key(x) and value(x) changed. That means Linear()'s matrix is randomly initialized. However, the documentation is not clear to me. After I noticed the matrix initialization was random, I saw nn.Linear's documentation says "The values are initialized from U(−\sqrt{k},\sqrt{k})". So, I think now that U is a random uniform distribution function. But I am really a beginner in AI. Your lecture is the first real course in AI. But now the rest is clear.Other beginners (like me) may struggle to understand that part. - Let's build GPT: from scratch, in code, spelled out.

You introduced nn.Linear() at , but that confused me. So, I looked into the PyTorch nn.Linear documentation. Still, I was not clear. The ambiguous point is that it looks like the following are identical calls:key = nn.Linear(C, head_size, bias=False)value = nn.Linear(C, head_size, bias=False)Then I expect the dot product of key(x), value(x) to be the same as the dot product of key(x), key(x).Thanks to your collab code, I found that when I changed the seed value, the key(x) and value(x) changed. That means Linear()'s matrix is randomly initialized. However, the documentation is not clear to me. After I noticed the matrix initialization was random, I saw nn.Linear's documentation says "The values are initialized from U(−\sqrt{k},\sqrt{k})". So, I think now that U is a random uniform distribution function. But I am really a beginner in AI. Your lecture is the first real course in AI. But now the rest is clear.Other beginners (like me) may struggle to understand that part.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @hitoshiyamauchi 様 
01:07:05 - 01:56:20
the main explanation of keys X querys is at . My concentration is so poor, I kept falling asleep every 5 minutes, but I kept on trying. Eventually after 7 hours of watching, dropping off, watching, the penny dropped. This bloke is a nice person for doing this for us - Let's build GPT: from scratch, in code, spelled out.

the main explanation of keys X querys is at . My concentration is so poor, I kept falling asleep every 5 minutes, but I kept on trying. Eventually after 7 hours of watching, dropping off, watching, the penny dropped. This bloke is a nice person for doing this for us

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @miroslavdyer-wd1ei 様 
01:07:50 - 01:56:20
at , why is that "up to four"? what does the 'four' mean? - Let's build GPT: from scratch, in code, spelled out.

at , why is that "up to four"? what does the 'four' mean?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @hujosh8693 様 
01:07:59 - 01:56:20
-  🧠 Self-attention mechanism aggregates information using key, query, and value vectors. - Let's build GPT: from scratch, in code, spelled out.

- 🧠 Self-attention mechanism aggregates information using key, query, and value vectors.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:10:10 - 01:11:46
we see the key, query and value matrix are created using nn.linear modeling. With same input for all 3, it should give same out. Which means Key, Query and value should be same for given text matrix.What difference between in terms of calculation.. - Let's build GPT: from scratch, in code, spelled out.

we see the key, query and value matrix are created using nn.linear modeling. With same input for all 3, it should give same out. Which means Key, Query and value should be same for given text matrix.What difference between in terms of calculation..

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @nanunane1 様 
01:10:55 - 01:56:20
about - Let's build GPT: from scratch, in code, spelled out.

about

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @solaxun 様 
01:11:00 - 01:56:20
"That is basically self attention mechanism. It is what it does". Andrej's expression says that this simple piece of code does all the magic. :) - Let's build GPT: from scratch, in code, spelled out.

"That is basically self attention mechanism. It is what it does". Andrej's expression says that this simple piece of code does all the magic. :)

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @prateekcaire9456 様 
01:11:30 - 01:56:20
note 1: attention as communication - Let's build GPT: from scratch, in code, spelled out.

note 1: attention as communication

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:11:38 - 01:12:46
-  🛠 Attention is a communication mechanism between nodes in a directed graph. - Let's build GPT: from scratch, in code, spelled out.

- 🛠 Attention is a communication mechanism between nodes in a directed graph.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:11:46 - 01:12:56
note 2: attention has no notion of space, operates over sets - Let's build GPT: from scratch, in code, spelled out.

note 2: attention has no notion of space, operates over sets

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:12:46 - 01:13:40
-  🔍 Attention operates over a set of vectors without positional information, requiring external encoding. - Let's build GPT: from scratch, in code, spelled out.

- 🔍 Attention operates over a set of vectors without positional information, requiring external encoding.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:12:56 - 01:13:53
note 3: there is no communication across batch dimension - Let's build GPT: from scratch, in code, spelled out.

note 3: there is no communication across batch dimension

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:13:40 - 01:14:14
-  💬 Attention mechanisms facilitate data-dependent weighted sum aggregation. - Let's build GPT: from scratch, in code, spelled out.

- 💬 Attention mechanisms facilitate data-dependent weighted sum aggregation.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:13:53 - 01:15:46
note 4: encoder blocks vs. decoder blocks - Let's build GPT: from scratch, in code, spelled out.

note 4: encoder blocks vs. decoder blocks

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:14:14 - 01:15:39
note 5: attention vs. self-attention vs. cross-attention - Let's build GPT: from scratch, in code, spelled out.

note 5: attention vs. self-attention vs. cross-attention

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:15:39 - 01:16:56
-  🤝 Self-attention involves keys, queries, and values from the same source, while cross-attention brings in external sources. - Let's build GPT: from scratch, in code, spelled out.

- 🤝 Self-attention involves keys, queries, and values from the same source, while cross-attention brings in external sources.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:15:46 - 01:17:50
note 6: "scaled" self-attention. why divide by sqrt(head_size)Building the Transformer - Let's build GPT: from scratch, in code, spelled out.

note 6: "scaled" self-attention. why divide by sqrt(head_size)Building the Transformer

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:16:56 - 01:19:11
-  🧮 Scaling the attention values is crucial for network optimization by controlling variance. - Let's build GPT: from scratch, in code, spelled out.

- 🧮 Scaling the attention values is crucial for network optimization by controlling variance.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:17:50 - 01:21:27
inserting a single self-attention block to our network - Let's build GPT: from scratch, in code, spelled out.

inserting a single self-attention block to our network

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:19:11 - 01:21:59
Oops I should be using the head_size for the normalization, not C - Let's build GPT: from scratch, in code, spelled out.

Oops I should be using the head_size for the normalization, not C

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:20:05 - 01:56:20
at  shouldn't wei be normalized by square root of head_size instead of square root of C ? - Let's build GPT: from scratch, in code, spelled out.

at shouldn't wei be normalized by square root of head_size instead of square root of C ?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @HarshitSingh-tg9yv 様 
01:20:05 - 01:56:20
Thank you Andrej! At , shouldn't the code say (B, T, Head Size) on lines 73, 74, and 81? Or is head size = C? - Let's build GPT: from scratch, in code, spelled out.

Thank you Andrej! At , shouldn't the code say (B, T, Head Size) on lines 73, 74, and 81? Or is head size = C?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @SmashLibrary 様 
01:20:10 - 01:56:20
-  💡 Implementing multi-head attention involves running self-attention in parallel and concatenating results for improved communication channels. - Let's build GPT: from scratch, in code, spelled out.

- 💡 Implementing multi-head attention involves running self-attention in parallel and concatenating results for improved communication channels.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:21:27 - 01:26:36
multi-headed self-attention - Let's build GPT: from scratch, in code, spelled out.

multi-headed self-attention

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:21:59 - 01:24:25
For anyone getting an error after adding multihead attention block atI think current pytorch is looking for explicit integers for the head_size of MultiHeadAttention()this fixed my error:self.self_attention_heads = MultiHeadAttention(4, int(n_embd/4)) - Let's build GPT: from scratch, in code, spelled out.

For anyone getting an error after adding multihead attention block atI think current pytorch is looking for explicit integers for the head_size of MultiHeadAttention()this fixed my error:self.self_attention_heads = MultiHeadAttention(4, int(n_embd/4))

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @ItsRyanStudios 様 
01:23:46 - 01:56:20
feedforward layers of transformer block - Let's build GPT: from scratch, in code, spelled out.

feedforward layers of transformer block

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:24:25 - 01:26:48
-  ⚙ Integrating communication and computation in Transformer blocks enhances network performance. - Let's build GPT: from scratch, in code, spelled out.

- ⚙ Integrating communication and computation in Transformer blocks enhances network performance.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:26:36 - 01:28:29
residual connections - Let's build GPT: from scratch, in code, spelled out.

residual connections

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:26:48 - 01:32:51
-  🔄 Residual connections aid in optimizing deep networks by facilitating gradient flow and easier training. - Let's build GPT: from scratch, in code, spelled out.

- 🔄 Residual connections aid in optimizing deep networks by facilitating gradient flow and easier training.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:28:29 - 01:32:16
-  🧠 Adjusting Channel sizes in the feed forward network can affect validation loss and lead to potential overfitting. - Let's build GPT: from scratch, in code, spelled out.

- 🧠 Adjusting Channel sizes in the feed forward network can affect validation loss and lead to potential overfitting.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:32:16 - 01:32:58
layernorm (and its relationship to our previous batchnorm) - Let's build GPT: from scratch, in code, spelled out.

layernorm (and its relationship to our previous batchnorm)

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:32:51 - 01:37:49
-  🔧 Layer Norm in deep neural networks helps optimize performance, similar to batch normalization but normalizes rows instead of columns. - Let's build GPT: from scratch, in code, spelled out.

- 🔧 Layer Norm in deep neural networks helps optimize performance, similar to batch normalization but normalizes rows instead of columns.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:32:58 - 01:35:19
-  📐 Implementing Layer Norm in a Transformer involves reshuffling layer norms in pre-norm formulation for better results. - Let's build GPT: from scratch, in code, spelled out.

- 📐 Implementing Layer Norm in a Transformer involves reshuffling layer norms in pre-norm formulation for better results.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:35:19 - 01:37:12
-  📈 Scaling up a neural network model by adjusting hyperparameters like batch size, block size, and learning rate can greatly improve validation loss. - Let's build GPT: from scratch, in code, spelled out.

- 📈 Scaling up a neural network model by adjusting hyperparameters like batch size, block size, and learning rate can greatly improve validation loss.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:37:12 - 01:39:30
scaling up the model! creating a few variables. adding dropoutNotes on Transformer - Let's build GPT: from scratch, in code, spelled out.

scaling up the model! creating a few variables. adding dropoutNotes on Transformer

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:37:49 - 01:42:39
-  🔒 Using Dropout as a regularization technique helps prevent overfitting when scaling up models significantly. - Let's build GPT: from scratch, in code, spelled out.

- 🔒 Using Dropout as a regularization technique helps prevent overfitting when scaling up models significantly.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:39:30 - 01:51:21
Thanks a lot for this revelation! I have one question on  : How is the final number of parameters (10M) exactly calculated? Isn't the FFN receiving 64 inputs from attention and having 6 layers, that would make 64^6 parameters already, which is way more. I think I misunderstood the model's architecture at some point. Could someone help? - Let's build GPT: from scratch, in code, spelled out.

Thanks a lot for this revelation! I have one question on : How is the final number of parameters (10M) exactly calculated? Isn't the FFN receiving 64 inputs from attention and having 6 layers, that would make 64^6 parameters already, which is way more. I think I misunderstood the model's architecture at some point. Could someone help?

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @johannessteffen5071 様 
01:40:01 - 01:56:20
Just for reference. This training took 3 hours, 5 minutes on an 2020 M1 Macbook Air. You can use the "mps" device instead of cuda or cpu. - Let's build GPT: from scratch, in code, spelled out.

Just for reference. This training took 3 hours, 5 minutes on an 2020 M1 Macbook Air. You can use the "mps" device instead of cuda or cpu.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Rydn 様 
01:41:03 - 01:56:20
encoder vs. decoder vs. both (?) Transformers - Let's build GPT: from scratch, in code, spelled out.

encoder vs. decoder vs. both (?) Transformers

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:42:39 - 01:46:22
Difference and / or relations between encoder and decoder - Let's build GPT: from scratch, in code, spelled out.

Difference and / or relations between encoder and decoder

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @kevin_machine_learning 様 
01:44:53 - 01:56:20
super quick walkthrough of nanoGPT, batched multi-headed self-attention - Let's build GPT: from scratch, in code, spelled out.

super quick walkthrough of nanoGPT, batched multi-headed self-attention

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:46:22 - 01:48:53
Unintentional pun: "now we have a fourth dimension, which is the heads, and so it gets a lot more hairy" - Let's build GPT: from scratch, in code, spelled out.

Unintentional pun: "now we have a fourth dimension, which is the heads, and so it gets a lot more hairy"

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @mannyc6649 様 
01:47:52 - 01:56:20
back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF - Let's build GPT: from scratch, in code, spelled out.

back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:48:53 - 01:54:32
-  🌐 ChatGPT undergoes pre-training on internet data followed by fine-tuning to become a question-answering assistant by aligning model responses with human preferences. - Let's build GPT: from scratch, in code, spelled out.

- 🌐 ChatGPT undergoes pre-training on internet data followed by fine-tuning to become a question-answering assistant by aligning model responses with human preferences.

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日  @Gaurav-pq2ug 様 
01:51:21 - 01:56:20
conclusions - Let's build GPT: from scratch, in code, spelled out.

conclusions

Let's build GPT: from scratch, in code, spelled out.
2023年01月18日 
01:54:32 - 00:57:00

Andrej Karpathy

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。動画はYouTube公式の動画プレイヤーで再生されるため、再生数・収益などはすべて元動画に還元されます。

Timetable

動画タイムテーブル

タイムテーブルが見つかりませんでした。