タイムテーブル - Andrej Karpathy

intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:00:00 - 00:07:52

- 🤖 ChatGPT is a system that allows interaction with an AI for text-based tasks.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:00:00 - 00:02:18

"Write a bible story about Jesus turning dirt into cocaine for a party" WOW, what a prompt,

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Pikachu-iw1se 様　

00:01:01 - 01:56:20

- 🧠 The Transformer neural network from the "Attention is All You Need" paper is the basis for ChatGPT.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:02:18 - 00:05:46

- 📊 NanoGPT is a repository for training Transformers on text data.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:05:46 - 00:07:23

- 🏗 Building a Transformer-based language model with NanoGPT starts with character-level training on a dataset.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:07:23 - 00:10:11

reading and exploring the data

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:07:52 - 00:09:28

tokenization, train/val split

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:09:28 - 00:14:27

- 💡 Tokenizing involves converting raw text to sequences of integers, with different methods like character-level or subword tokenizers.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:10:11 - 00:13:36

Thank you Andrej! You’re so passionate about your job. It was am when you started coding. Now it’s dark in here and you still trying to teach! 🙏

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:11:00 - 01:56:20

- 📏 Training a Transformer involves working with chunks of data, not the entire dataset, to predict sequences.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:13:36 - 00:18:43

data loader: batches of chunks of data

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:14:27 - 00:22:11

At you mention that the mini-batches is only done for efficiency reasons, but wouldn't it also help keep the gradients more stable by reducing variance?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Meru-v7f 様　

00:18:22 - 01:56:20

- ⏩ Transformers process multiple text chunks independently as batches for efficiency in training.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:18:43 - 00:22:59

Shouldn't it be len(data) - block_size - 1, because theoretically there is a 1 in a million chance or whatever the total len of the chars is of getting the len(data) - 8 for x and then len(data) - 7 for y and then the last index in data[i+1:i+block_size+1] will be outside the list.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @ziga1122 様　

00:19:38 - 01:56:20

simplest baseline: bigram language model, loss, generation

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:22:11 - 00:34:53

- 🧠 Explaining the creation of a token embedding table.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:22:59 - 00:24:09

- 🎯 Predicting the next character based on individual token identity.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:24:09 - 00:25:19

Hi Andrej, thank you so much for investing your time on sharing this priceless video. I have a question at , when the input to the embedding block is B * T tensor & the output to the embedding block should be called the embeddings for the given tensor.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @harshmittal63 様　

00:24:13 - 01:56:20

My note : Continue watching from

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @lokeshwaranm244 様　

00:24:50 - 01:56:20

It sounds like the transformers are great, but the neutral networl is where you make or break your AI. If thats not encoded properly to already know rules about what it means to be "5" then your SoL

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @KillianTwew 様　

00:24:52 - 01:56:20

- 💡 Using negative log likelihood loss (cross entropy) to measure prediction quality.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:25:19 - 00:26:44

- 🔄 Reshaping logits for appropriate input to cross entropy function.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:26:44 - 00:28:22

At , why can't we write logits = logits.view(B,C,T) and keep targets the same? When I do this the loss value differs and I can't understand why.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @gauravfotedar 様　

00:27:45 - 01:56:20

- 💻 Training the model using the optimizer Adam with a larger batch size.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:28:22 - 00:31:21

@ Why is the expected nll -ln(1/65) ? how did the ratio 1/65 come about?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @anusuyanallathambi248 様　

00:28:26 - 01:56:20

- never would have I ever expected to get Rick-rolled by Andrej

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @hex6dec1mal 様　

00:28:27 - 01:56:20

are you kidding me I get Rick Rolled in a video about LLMs?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @joevero4568 様　

00:28:31 - 01:56:20

Is there a difference between categorical sampling and software + multinomial if we're sampling a single item? []

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @DiogoNeves 様　

00:30:01 - 01:56:20

- 🏗 Generating tokens from the model by sampling via softmax probabilities.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:31:21 - 00:34:38

- 🛠 Training loop includes evaluation of loss and parameter updates.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:34:38 - 00:41:23

training the bigram model

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:34:53 - 00:38:00

- how come a specific letter can be followed with various others? If the model is about bigrams, and it has certain constant weights - then one would think that a letter will always lead to the calculation of of the same following letter. Yet they vary producing some long ~random input.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @FreakyStyleytobby 様　

00:37:00 - 01:56:20

"OK, so we see that we starting to get something at least like reasonable-ish" :,DI love this tutorial! Thank you for your time and passion!

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @karakson 様　

00:37:18 - 01:56:20

A very nice piece of Vogon poetry at

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @DaveJ6515 様　

00:37:32 - 01:56:20

Question: ❓at min you say that with the first bigram model predicts starting only from the previous character, but I see that the first word is POPSousthe.... now, if after the first P comes an O, but after the following P comes an S... where is this variation coming from? Some other people has an answer?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @FilippoBasso73 様　

00:37:48 - 01:56:20

port our code to a scriptBuilding the "self-attention"

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:38:00 - 00:42:13

At shouldn't line 115 read logits, loss = m(xb, yb) rather than logits, loss = model(xb, yb). Similarly with line 54?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @pennyfarthingchapel 様　

00:38:50 - 01:56:20

- 📉 Using `torch.no_grad()` for efficient memory usage during evaluation.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:41:23 - 00:45:59

version 1: averaging past context with for loops, the weakest form of aggregation

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:42:13 - 00:47:11

when he says we take the average. is he implying that we take the average of the token mapped numbers? if yes, how would that remotely help?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @FreakyBaguette 様　

00:45:20 - 01:56:20

- 🧮 Tokens are averaged out to create a one-dimensional vector for efficient processing

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:45:59 - 00:47:22

the trick in self-attention: matrix multiply as weighted aggregation

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:47:11 - 00:51:54

- 🔢 Matrix multiplication can efficiently perform aggregations instead of averages

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:47:22 - 00:50:27

Just that little time you take to explain a trick at shows how great of a teacher you are, thanks a lot for this video !

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @to-grt 様　

00:47:30 - 01:56:20

- 🔀 Manipulating elements in a multiplying matrix allows for incremental averaging based on 'ones' and 'zeros'

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:50:27 - 00:54:51

version 2: using matrix multiply

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:51:54 - 00:54:42

version 3: adding softmax

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:54:42 - 00:58:26

- 🔄 Introduction of softmax helps in setting interaction strengths and affinities between tokens

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:54:51 - 00:58:27

I think there is a mistake at time . Andrej said that "tokens from the past cannot communicate". I think the correct version is "tokens from the future cannot communicate".

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @gegao9066 様　

00:56:59 - 01:56:20

Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:57:00 - 01:20:05

minor code cleanup

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

00:58:26 - 01:00:18

- 🧠 Weighted aggregation of past elements using matrix multiplication aids in self-attention block development

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

00:58:27 - 01:02:07

positional encoding

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:00:18 - 01:02:00

Watch the video once. Watch it again. Watch it a few more times. Then watch - 20 times, melting your brain trying to keep track of tensor dimensions. This is a dense video - amazing how much detail is packed into 2 hours... thanks for this Andrej!

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @solaxun 様　

01:01:00 - 01:11:00

AM until I saw the message at :)

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @muthuraja4172 様　

01:01:58 - 01:56:20

THE CRUX OF THE VIDEO: version 4: self-attention

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:02:00 - 01:11:38

- 🔂 Self-attention involves emitting query and key vectors to determine token affinities and weighted aggregations

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:02:07 - 01:05:13

the top and most important part. What a great guy!

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @jose-kj3fg 様　

01:03:53 - 01:56:20

- 🎭 Implementing a single head of self-attention involves computing queries and keys and performing dot products for weighted aggregations.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:05:13 - 01:10:10

$You introduced nn.Linear() at , but that confused me. So, I looked into the PyTorch nn.Linear documentation. Still, I was not clear. The ambiguous point is that it looks like the following are identical calls:key = nn.Linear(C, head_size, bias=False)value = nn.Linear(C, head_size, bias=False)Then I expect the dot product of key(x), value(x) to be the same as the dot product of key(x), key(x).Thanks to your collab code, I found that when I changed the seed value, the key(x) and value(x) changed. That means Linear()'s matrix is randomly initialized. However, the documentation is not clear to me. After I noticed the matrix initialization was random, I saw nn.Linear's documentation says "The values are initialized from U(−\sqrt{k},\sqrt{k})". So, I think now that U is a random uniform distribution function. But I am really a beginner in AI. Your lecture is the first real course in AI. But now the rest is clear.Other beginners (like me) may struggle to understand that part. - Let's build GPT: from scratch, in code, spelled out.$

You introduced nn.Linear() at , but that confused me. So, I looked into the PyTorch nn.Linear documentation. Still, I was not clear. The ambiguous point is that it looks like the following are identical calls:key = nn.Linear(C, head_size, bias=False)value = nn.Linear(C, head_size, bias=False)Then I expect the dot product of key(x), value(x) to be the same as the dot product of key(x), key(x).Thanks to your collab code, I found that when I changed the seed value, the key(x) and value(x) changed. That means Linear()'s matrix is randomly initialized. However, the documentation is not clear to me. After I noticed the matrix initialization was random, I saw nn.Linear's documentation says "The values are initialized from U(−\sqrt{k},\sqrt{k})". So, I think now that U is a random uniform distribution function. But I am really a beginner in AI. Your lecture is the first real course in AI. But now the rest is clear.Other beginners (like me) may struggle to understand that part.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @hitoshiyamauchi 様　

01:07:05 - 01:56:20

the main explanation of keys X querys is at . My concentration is so poor, I kept falling asleep every 5 minutes, but I kept on trying. Eventually after 7 hours of watching, dropping off, watching, the penny dropped. This bloke is a nice person for doing this for us

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @miroslavdyer-wd1ei 様　

01:07:50 - 01:56:20

at , why is that "up to four"? what does the 'four' mean?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @hujosh8693 様　

01:07:59 - 01:56:20

- 🧠 Self-attention mechanism aggregates information using key, query, and value vectors.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:10:10 - 01:11:46

we see the key, query and value matrix are created using nn.linear modeling. With same input for all 3, it should give same out. Which means Key, Query and value should be same for given text matrix.What difference between in terms of calculation..

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @nanunane1 様　

01:10:55 - 01:56:20

about

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @solaxun 様　

01:11:00 - 01:56:20

"That is basically self attention mechanism. It is what it does". Andrej's expression says that this simple piece of code does all the magic. :)

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @prateekcaire9456 様　

01:11:30 - 01:56:20

note 1: attention as communication

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:11:38 - 01:12:46

- 🛠 Attention is a communication mechanism between nodes in a directed graph.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:11:46 - 01:12:56

note 2: attention has no notion of space, operates over sets

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:12:46 - 01:13:40

- 🔍 Attention operates over a set of vectors without positional information, requiring external encoding.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:12:56 - 01:13:53

note 3: there is no communication across batch dimension

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:13:40 - 01:14:14

- 💬 Attention mechanisms facilitate data-dependent weighted sum aggregation.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:13:53 - 01:15:46

note 4: encoder blocks vs. decoder blocks

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:14:14 - 01:15:39

note 5: attention vs. self-attention vs. cross-attention

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:15:39 - 01:16:56

- 🤝 Self-attention involves keys, queries, and values from the same source, while cross-attention brings in external sources.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:15:46 - 01:17:50

note 6: "scaled" self-attention. why divide by sqrt(head_size)Building the Transformer

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:16:56 - 01:19:11

- 🧮 Scaling the attention values is crucial for network optimization by controlling variance.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:17:50 - 01:21:27

inserting a single self-attention block to our network

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:19:11 - 01:21:59

Oops I should be using the head_size for the normalization, not C

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:20:05 - 01:56:20

at shouldn't wei be normalized by square root of head_size instead of square root of C ?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @HarshitSingh-tg9yv 様　

01:20:05 - 01:56:20

Thank you Andrej! At , shouldn't the code say (B, T, Head Size) on lines 73, 74, and 81? Or is head size = C?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @SmashLibrary 様　

01:20:10 - 01:56:20

- 💡 Implementing multi-head attention involves running self-attention in parallel and concatenating results for improved communication channels.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:21:27 - 01:26:36

multi-headed self-attention

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:21:59 - 01:24:25

For anyone getting an error after adding multihead attention block atI think current pytorch is looking for explicit integers for the head_size of MultiHeadAttention()this fixed my error:self.self_attention_heads = MultiHeadAttention(4, int(n_embd/4))

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @ItsRyanStudios 様　

01:23:46 - 01:56:20

feedforward layers of transformer block

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:24:25 - 01:26:48

- ⚙ Integrating communication and computation in Transformer blocks enhances network performance.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:26:36 - 01:28:29

residual connections

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:26:48 - 01:32:51

- 🔄 Residual connections aid in optimizing deep networks by facilitating gradient flow and easier training.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:28:29 - 01:32:16

- 🧠 Adjusting Channel sizes in the feed forward network can affect validation loss and lead to potential overfitting.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:32:16 - 01:32:58

layernorm (and its relationship to our previous batchnorm)

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:32:51 - 01:37:49

- 🔧 Layer Norm in deep neural networks helps optimize performance, similar to batch normalization but normalizes rows instead of columns.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:32:58 - 01:35:19

- 📐 Implementing Layer Norm in a Transformer involves reshuffling layer norms in pre-norm formulation for better results.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:35:19 - 01:37:12

- 📈 Scaling up a neural network model by adjusting hyperparameters like batch size, block size, and learning rate can greatly improve validation loss.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:37:12 - 01:39:30

scaling up the model! creating a few variables. adding dropoutNotes on Transformer

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:37:49 - 01:42:39

- 🔒 Using Dropout as a regularization technique helps prevent overfitting when scaling up models significantly.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:39:30 - 01:51:21

Thanks a lot for this revelation! I have one question on : How is the final number of parameters (10M) exactly calculated? Isn't the FFN receiving 64 inputs from attention and having 6 layers, that would make 64^6 parameters already, which is way more. I think I misunderstood the model's architecture at some point. Could someone help?

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @johannessteffen5071 様　

01:40:01 - 01:56:20

Just for reference. This training took 3 hours, 5 minutes on an 2020 M1 Macbook Air. You can use the "mps" device instead of cuda or cpu.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Rydn 様　

01:41:03 - 01:56:20

encoder vs. decoder vs. both (?) Transformers

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:42:39 - 01:46:22

Difference and / or relations between encoder and decoder

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @kevin_machine_learning 様　

01:44:53 - 01:56:20

super quick walkthrough of nanoGPT, batched multi-headed self-attention

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:46:22 - 01:48:53

Unintentional pun: "now we have a fourth dimension, which is the heads, and so it gets a lot more hairy"

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @mannyc6649 様　

01:47:52 - 01:56:20

back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:48:53 - 01:54:32

- 🌐 ChatGPT undergoes pre-training on internet data followed by fine-tuning to become a question-answering assistant by aligning model responses with human preferences.

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　 @Gaurav-pq2ug 様　

01:51:21 - 01:56:20

conclusions

Let's build GPT: from scratch, in code, spelled out.

2023年01月18日　

01:54:32 - 00:57:00

Andrej Karpathy

Timetable

よく話題になっている単語

intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup

- 🤖 ChatGPT is a system that allows interaction with an AI for text-based tasks.

"Write a bible story about Jesus turning dirt into cocaine for a party" WOW, what a prompt,

- 🧠 The Transformer neural network from the "Attention is All You Need" paper is the basis for ChatGPT.

- 📊 NanoGPT is a repository for training Transformers on text data.

- 🏗 Building a Transformer-based language model with NanoGPT starts with character-level training on a dataset.

reading and exploring the data

tokenization, train/val split

- 💡 Tokenizing involves converting raw text to sequences of integers, with different methods like character-level or subword tokenizers.

Thank you Andrej! You’re so passionate about your job. It was am when you started coding. Now it’s dark in here and you still trying to teach! 🙏

- 📏 Training a Transformer involves working with chunks of data, not the entire dataset, to predict sequences.

data loader: batches of chunks of data

At you mention that the mini-batches is only done for efficiency reasons, but wouldn't it also help keep the gradients more stable by reducing variance?

- ⏩ Transformers process multiple text chunks independently as batches for efficiency in training.

Shouldn't it be len(data) - block_size - 1, because theoretically there is a 1 in a million chance or whatever the total len of the chars is of getting the len(data) - 8 for x and then len(data) - 7 for y and then the last index in data[i+1:i+block_size+1] will be outside the list.

simplest baseline: bigram language model, loss, generation

- 🧠 Explaining the creation of a token embedding table.

- 🎯 Predicting the next character based on individual token identity.

Hi Andrej, thank you so much for investing your time on sharing this priceless video. I have a question at , when the input to the embedding block is B * T tensor & the output to the embedding block should be called the embeddings for the given tensor.

My note : Continue watching from

It sounds like the transformers are great, but the neutral networl is where you make or break your AI. If thats not encoded properly to already know rules about what it means to be "5" then your SoL

- 💡 Using negative log likelihood loss (cross entropy) to measure prediction quality.

- 🔄 Reshaping logits for appropriate input to cross entropy function.

At , why can't we write logits = logits.view(B,C,T) and keep targets the same? When I do this the loss value differs and I can't understand why.

- 💻 Training the model using the optimizer Adam with a larger batch size.

@ Why is the expected nll -ln(1/65) ? how did the ratio 1/65 come about?

- never would have I ever expected to get Rick-rolled by Andrej

are you kidding me I get Rick Rolled in a video about LLMs?

Is there a difference between categorical sampling and software + multinomial if we're sampling a single item? []

- 🏗 Generating tokens from the model by sampling via softmax probabilities.

- 🛠 Training loop includes evaluation of loss and parameter updates.

training the bigram model

- how come a specific letter can be followed with various others? If the model is about bigrams, and it has certain constant weights - then one would think that a letter will always lead to the calculation of of the same following letter. Yet they vary producing some long ~random input.

"OK, so we see that we starting to get something at least like reasonable-ish" :,DI love this tutorial! Thank you for your time and passion!

A very nice piece of Vogon poetry at

port our code to a scriptBuilding the "self-attention"

At shouldn't line 115 read logits, loss = m(xb, yb) rather than logits, loss = model(xb, yb). Similarly with line 54?

- 📉 Using `torch.no_grad()` for efficient memory usage during evaluation.

version 1: averaging past context with for loops, the weakest form of aggregation

when he says we take the average. is he implying that we take the average of the token mapped numbers? if yes, how would that remotely help?

- 🧮 Tokens are averaged out to create a one-dimensional vector for efficient processing

the trick in self-attention: matrix multiply as weighted aggregation

- 🔢 Matrix multiplication can efficiently perform aggregations instead of averages

Just that little time you take to explain a trick at shows how great of a teacher you are, thanks a lot for this video !

- 🔀 Manipulating elements in a multiplying matrix allows for incremental averaging based on 'ones' and 'zeros'

version 2: using matrix multiply

version 3: adding softmax

- 🔄 Introduction of softmax helps in setting interaction strengths and affinities between tokens

I think there is a mistake at time . Andrej said that "tokens from the past cannot communicate". I think the correct version is "tokens from the future cannot communicate".

Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)

minor code cleanup

- 🧠 Weighted aggregation of past elements using matrix multiplication aids in self-attention block development

positional encoding

Watch the video once. Watch it again. Watch it a few more times. Then watch - 20 times, melting your brain trying to keep track of tensor dimensions. This is a *dense* video - amazing how much detail is packed into 2 hours... thanks for this Andrej!

AM until I saw the message at :)

THE CRUX OF THE VIDEO: version 4: self-attention

- 🔂 Self-attention involves emitting query and key vectors to determine token affinities and weighted aggregations

the top and most important part. What a great guy!

- 🎭 Implementing a single head of self-attention involves computing queries and keys and performing dot products for weighted aggregations.

the main explanation of keys X querys is at . My concentration is so poor, I kept falling asleep every 5 minutes, but I kept on trying. Eventually after 7 hours of watching, dropping off, watching, the penny dropped. This bloke is a nice person for doing this for us

at , why is that "up to four"? what does the 'four' mean?

- 🧠 Self-attention mechanism aggregates information using key, query, and value vectors.

we see the key, query and value matrix are created using nn.linear modeling. With same input for all 3, it should give same out. Which means Key, Query and value should be same for given text matrix.What difference between in terms of calculation..

about

"That is basically self attention mechanism. It is what it does". Andrej's expression says that this simple piece of code does all the magic. :)

note 1: attention as communication

- 🛠 Attention is a communication mechanism between nodes in a directed graph.

note 2: attention has no notion of space, operates over sets

- 🔍 Attention operates over a set of vectors without positional information, requiring external encoding.

note 3: there is no communication across batch dimension

- 💬 Attention mechanisms facilitate data-dependent weighted sum aggregation.

note 4: encoder blocks vs. decoder blocks

note 5: attention vs. self-attention vs. cross-attention

- 🤝 Self-attention involves keys, queries, and values from the same source, while cross-attention brings in external sources.

note 6: "scaled" self-attention. why divide by sqrt(head_size)Building the Transformer

- 🧮 Scaling the attention values is crucial for network optimization by controlling variance.

inserting a single self-attention block to our network

Watch the video once. Watch it again. Watch it a few more times. Then watch - 20 times, melting your brain trying to keep track of tensor dimensions. This is a dense video - amazing how much detail is packed into 2 hours... thanks for this Andrej!