動画数:17件

intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup

- 🤖 ChatGPT is a system that allows interaction with an AI for text-based tasks.

"Write a bible story about Jesus turning dirt into cocaine for a party" WOW, what a prompt,

- 🧠 The Transformer neural network from the "Attention is All You Need" paper is the basis for ChatGPT.

- 📊 NanoGPT is a repository for training Transformers on text data.

- 🏗 Building a Transformer-based language model with NanoGPT starts with character-level training on a dataset.

reading and exploring the data

tokenization, train/val split

- 💡 Tokenizing involves converting raw text to sequences of integers, with different methods like character-level or subword tokenizers.

Thank you Andrej! You’re so passionate about your job. It was am when you started coding. Now it’s dark in here and you still trying to teach! 🙏

- 📏 Training a Transformer involves working with chunks of data, not the entire dataset, to predict sequences.

data loader: batches of chunks of data

At you mention that the mini-batches is only done for efficiency reasons, but wouldn't it also help keep the gradients more stable by reducing variance?

- ⏩ Transformers process multiple text chunks independently as batches for efficiency in training.
![Shouldn't it be len(data) - block_size - 1, because theoretically there is a 1 in a million chance or whatever the total len of the chars is of getting the len(data) - 8 for x and then len(data) - 7 for y and then the last index in data[i+1:i+block_size+1] will be outside the list. - Let's build GPT: from scratch, in code, spelled out.](https://img.youtube.com/vi/kCc8FmEb1nY/mqdefault.jpg)
Shouldn't it be len(data) - block_size - 1, because theoretically there is a 1 in a million chance or whatever the total len of the chars is of getting the len(data) - 8 for x and then len(data) - 7 for y and then the last index in data[i+1:i+block_size+1] will be outside the list.

simplest baseline: bigram language model, loss, generation

- 🧠 Explaining the creation of a token embedding table.

- 🎯 Predicting the next character based on individual token identity.

Hi Andrej, thank you so much for investing your time on sharing this priceless video. I have a question at , when the input to the embedding block is B * T tensor & the output to the embedding block should be called the embeddings for the given tensor.

My note : Continue watching from

It sounds like the transformers are great, but the neutral networl is where you make or break your AI. If thats not encoded properly to already know rules about what it means to be "5" then your SoL

- 💡 Using negative log likelihood loss (cross entropy) to measure prediction quality.

- 🔄 Reshaping logits for appropriate input to cross entropy function.

At , why can't we write logits = logits.view(B,C,T) and keep targets the same? When I do this the loss value differs and I can't understand why.

- 💻 Training the model using the optimizer Adam with a larger batch size.

@ Why is the expected nll -ln(1/65) ? how did the ratio 1/65 come about?

- never would have I ever expected to get Rick-rolled by Andrej

are you kidding me I get Rick Rolled in a video about LLMs?
![Is there a difference between categorical sampling and software + multinomial if we're sampling a single item? [] - Let's build GPT: from scratch, in code, spelled out.](https://img.youtube.com/vi/kCc8FmEb1nY/mqdefault.jpg)
Is there a difference between categorical sampling and software + multinomial if we're sampling a single item? []

- 🏗 Generating tokens from the model by sampling via softmax probabilities.

- 🛠 Training loop includes evaluation of loss and parameter updates.

training the bigram model

- how come a specific letter can be followed with various others? If the model is about bigrams, and it has certain constant weights - then one would think that a letter will always lead to the calculation of of the same following letter. Yet they vary producing some long ~random input.

"OK, so we see that we starting to get something at least like reasonable-ish" :,DI love this tutorial! Thank you for your time and passion!

A very nice piece of Vogon poetry at

Question: ❓at min you say that with the first bigram model predicts starting only from the previous character, but I see that the first word is POPSousthe.... now, if after the first P comes an O, but after the following P comes an S... where is this variation coming from? Some other people has an answer?

port our code to a scriptBuilding the "self-attention"

At shouldn't line 115 read logits, loss = m(xb, yb) rather than logits, loss = model(xb, yb). Similarly with line 54?

- 📉 Using `torch.no_grad()` for efficient memory usage during evaluation.

version 1: averaging past context with for loops, the weakest form of aggregation

when he says we take the average. is he implying that we take the average of the token mapped numbers? if yes, how would that remotely help?

- 🧮 Tokens are averaged out to create a one-dimensional vector for efficient processing

the trick in self-attention: matrix multiply as weighted aggregation

- 🔢 Matrix multiplication can efficiently perform aggregations instead of averages

Just that little time you take to explain a trick at shows how great of a teacher you are, thanks a lot for this video !

- 🔀 Manipulating elements in a multiplying matrix allows for incremental averaging based on 'ones' and 'zeros'

version 2: using matrix multiply

version 3: adding softmax

- 🔄 Introduction of softmax helps in setting interaction strengths and affinities between tokens

I think there is a mistake at time . Andrej said that "tokens from the past cannot communicate". I think the correct version is "tokens from the future cannot communicate".

Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)

minor code cleanup

- 🧠 Weighted aggregation of past elements using matrix multiplication aids in self-attention block development

positional encoding

Watch the video once. Watch it again. Watch it a few more times. Then watch - 20 times, melting your brain trying to keep track of tensor dimensions. This is a *dense* video - amazing how much detail is packed into 2 hours... thanks for this Andrej!

AM until I saw the message at :)

THE CRUX OF THE VIDEO: version 4: self-attention

- 🔂 Self-attention involves emitting query and key vectors to determine token affinities and weighted aggregations

the top and most important part. What a great guy!

- 🎭 Implementing a single head of self-attention involves computing queries and keys and performing dot products for weighted aggregations.

You introduced nn.Linear() at , but that confused me. So, I looked into the PyTorch nn.Linear documentation. Still, I was not clear. The ambiguous point is that it looks like the following are identical calls:key = nn.Linear(C, head_size, bias=False)value = nn.Linear(C, head_size, bias=False)Then I expect the dot product of key(x), value(x) to be the same as the dot product of key(x), key(x).Thanks to your collab code, I found that when I changed the seed value, the key(x) and value(x) changed. That means Linear()'s matrix is randomly initialized. However, the documentation is not clear to me. After I noticed the matrix initialization was random, I saw nn.Linear's documentation says "The values are initialized from U(−\sqrt{k},\sqrt{k})". So, I think now that U is a random uniform distribution function. But I am really a beginner in AI. Your lecture is the first real course in AI. But now the rest is clear.Other beginners (like me) may struggle to understand that part.

the main explanation of keys X querys is at . My concentration is so poor, I kept falling asleep every 5 minutes, but I kept on trying. Eventually after 7 hours of watching, dropping off, watching, the penny dropped. This bloke is a nice person for doing this for us

at , why is that "up to four"? what does the 'four' mean?

- 🧠 Self-attention mechanism aggregates information using key, query, and value vectors.

we see the key, query and value matrix are created using nn.linear modeling. With same input for all 3, it should give same out. Which means Key, Query and value should be same for given text matrix.What difference between in terms of calculation..

about

"That is basically self attention mechanism. It is what it does". Andrej's expression says that this simple piece of code does all the magic. :)

note 1: attention as communication

- 🛠 Attention is a communication mechanism between nodes in a directed graph.

note 2: attention has no notion of space, operates over sets

- 🔍 Attention operates over a set of vectors without positional information, requiring external encoding.

note 3: there is no communication across batch dimension

- 💬 Attention mechanisms facilitate data-dependent weighted sum aggregation.

note 4: encoder blocks vs. decoder blocks

note 5: attention vs. self-attention vs. cross-attention

- 🤝 Self-attention involves keys, queries, and values from the same source, while cross-attention brings in external sources.

note 6: "scaled" self-attention. why divide by sqrt(head_size)Building the Transformer

- 🧮 Scaling the attention values is crucial for network optimization by controlling variance.

inserting a single self-attention block to our network

Oops I should be using the head_size for the normalization, not C

at shouldn't wei be normalized by square root of head_size instead of square root of C ?

Thank you Andrej! At , shouldn't the code say (B, T, Head Size) on lines 73, 74, and 81? Or is head size = C?

- 💡 Implementing multi-head attention involves running self-attention in parallel and concatenating results for improved communication channels.

multi-headed self-attention

For anyone getting an error after adding multihead attention block atI think current pytorch is looking for explicit integers for the head_size of MultiHeadAttention()this fixed my error:self.self_attention_heads = MultiHeadAttention(4, int(n_embd/4))

feedforward layers of transformer block

- ⚙ Integrating communication and computation in Transformer blocks enhances network performance.

residual connections

- 🔄 Residual connections aid in optimizing deep networks by facilitating gradient flow and easier training.

- 🧠 Adjusting Channel sizes in the feed forward network can affect validation loss and lead to potential overfitting.

layernorm (and its relationship to our previous batchnorm)

- 🔧 Layer Norm in deep neural networks helps optimize performance, similar to batch normalization but normalizes rows instead of columns.

- 📐 Implementing Layer Norm in a Transformer involves reshuffling layer norms in pre-norm formulation for better results.

- 📈 Scaling up a neural network model by adjusting hyperparameters like batch size, block size, and learning rate can greatly improve validation loss.

scaling up the model! creating a few variables. adding dropoutNotes on Transformer

- 🔒 Using Dropout as a regularization technique helps prevent overfitting when scaling up models significantly.

Thanks a lot for this revelation! I have one question on : How is the final number of parameters (10M) exactly calculated? Isn't the FFN receiving 64 inputs from attention and having 6 layers, that would make 64^6 parameters already, which is way more. I think I misunderstood the model's architecture at some point. Could someone help?

Just for reference. This training took 3 hours, 5 minutes on an 2020 M1 Macbook Air. You can use the "mps" device instead of cuda or cpu.

encoder vs. decoder vs. both (?) Transformers

Difference and / or relations between encoder and decoder

super quick walkthrough of nanoGPT, batched multi-headed self-attention

Unintentional pun: "now we have a fourth dimension, which is the heads, and so it gets a lot more hairy"

back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

- 🌐 ChatGPT undergoes pre-training on internet data followed by fine-tuning to become a question-answering assistant by aligning model responses with human preferences.
