
intro

,
![[<1809.89it/s]Last Loss: 2.403459072113037Best Loss: 1.4457638263702393 At Epoch: 25480============================================================ - Building makemore Part 2: MLP](https://img.youtube.com/vi/TCH_1BHY58I/mqdefault.jpg)
[<1809.89it/s]Last Loss: 2.403459072113037Best Loss: 1.4457638263702393 At Epoch: 25480============================================================

PS. At I was just uber curious about his previous searches, so I google them:

Bengio et al. 2003 (MLP language model) paper walkthrough

Why space is small? Even in two-dimensional space you can place an infinite number of points

(re-)building our training dataset

implementing the embedding lookup table

Every time I think I finally understand what's happening, he does something like this: 😅

implementing the hidden layer + internals of torch.Tensor: storage, views

-dimensional vertically scrollable space to describe the functions of PyTorch ()

at I think it's supposed to be first letter not first word. It's first word in the paper but first letter in the example

At , when he says words does he mean the 3 character sequence that was made by block size? And, so, when he refers to the picture behind him, does he mean each of those three blocks represents a indice in the block_size array?
![what about just `emb_reshaped = emb.reshape((emb.shape[0], emb.shape[1]*emb.shape[2]))` ? - Building makemore Part 2: MLP](https://img.youtube.com/vi/TCH_1BHY58I/mqdefault.jpg)
what about just `emb_reshaped = emb.reshape((emb.shape[0], emb.shape[1]*emb.shape[2]))` ?

Of course! Memory itself is a one dimensional "tensor". :D

for the PyTorch internals video (@)

Please create the "entire video about the internals of pytorch" that you mentioned in . And thank you so much for the content, Andrej !!

at minute mark at the moment and gotta say, pytorch is amazing. so wonderful how easy they make it for devs with those small tricks.

matthew -31>Then the governor’s soldiers took Jesus into the Praetorium and gathered the whole company of soldiers around him. They stripped him and put a scarlet robe on him, and then twisted together a crown of thorns and set it on his head. They put a staff in his right hand. Then they knelt in front of him and mocked him. “Hail, king of the Jews!” they said. They spit on him, and took the staff and struck him on the head again and again. After they had mocked him, they took off the robe and put his own clothes on him. Then they led him away to crucify him.

proverbs>You will have plenty of goats’ milk to feed your family and to nourish your female servants.

implementing the output layer

we can also use torch.reshape() to get the right shape for W. However, there is a difference between torch.view and torch.reshapeTL;DR:If you just want to reshape tensors, use torch.reshape. If you're also concerned about memory usage and want to ensure that the two tensors share the same data, use torch.view.

implementing the negative log likelihood loss

What's tanh?

"ideally all of these numbers here of course are one because then we are correctly predicting the next character" hmmmmmm it's reasonable to say these numbers are high, put not one, If the probability here is one, that will exclude any chance of other characters having similar context.

summary of the full network

introducing F.cross_entropy and why

re: using cross_entropy function around , it sounds like pytorch takes the derivate of each step of exponentiation then normalization instead of simplifying them before taking the derivative. is that a "soft" limitation of the implementation in that a procedure could be defined to overcome it, or is there a bit of an mathematical intuition needed to understand how to rewrite the function to produce a simpler derivative?

Since probs are invariant to an offset applied to logits, it's fun to plot the drift in the mean or sum of b2. Looks like Brownian motion.

, who would tell you this when you are reading from a book. Exceptional teaching ability

implementing the training loop, overfitting one batch

pfeeeewwww 😳

training on the full dataset, minibatches

I don't understand the mini batching happening at . when using ix = torch.randint(0,X.shape,(32,)), and using this to index into X, you are just picking 32 data examples from X, not batching all of the data right? I thought by batching, you taking a batch of data, do a forward pass on all items in the batch, take the mean output and do back prop on that mean result outcome and update the model on that loss. Here I feel like Andrej is just selecting 32 individual data examples. Please do correct me if I'm wrong! I'm new to ML!

life lesson: much better to have an approximate gradient and take many steps than have an exact gradient and take a few steps

Awesome videos, thank you for that! I have a question though about , "finding a good initial learning rate", which is either a mistake in the video or I misunderstood something.

It seems it is slightly different from the approach presented here. Looking at the , it looks like for each iteration, we randomly select a min batch of size 32 from the whole training set, and update the parameters, then go on to the next iteration.

finding a good initial learning rate

@ 'Finding a good initial learning rate', each learning rate is used just one time. The adjustment of the parameter of one learning rate is based on the parameters already adjusted using the prior smaller learning rates. I feel that each of the 1,000 learning rate candidates should go through the same number of iterations. Then, the losses at the end of the iterations are compared. Please tell me if I am wrong. Thanks!

I don't quite understand the part of finding a good initial learning rate. Why the lowest point of loss value indicates the best learning rate? It takes some time for the loss value to decrease, right?

On I was waiting fot Karpathy's constant to appear. Thank you for the lecture, Andrej

At Andrej says that the learning rate would be low in the beginning and high at the end. Why was it set like that? My intuition is that the learning rate should be in the opposite order.
![I believe that at the losses and the learning rates are misaligned.The first loss (derived from completely random weights) is computed before the first learning rate is used, and therefor the first learning rate should be aligned with the second loss.You can simply solve this problem by using this snippet;lri = lri[:-1]lossi = lossi[1:] - Building makemore Part 2: MLP](https://img.youtube.com/vi/TCH_1BHY58I/mqdefault.jpg)
I believe that at the losses and the learning rates are misaligned.The first loss (derived from completely random weights) is computed before the first learning rate is used, and therefor the first learning rate should be aligned with the second loss.You can simply solve this problem by using this snippet;lri = lri[:-1]lossi = lossi[1:]

Question about , in the plot, y axis is the loss, and the x axis is learning rate, but x axis is also the step number. How do you know whether the y axis change is because of learning rate difference or step number increase?

Great video! One question, @AndrejKarpathy: around or so you show how to graph an optimal learning rate and ultimately you determine that the 0.1 you started with was pretty good. However, unless I'm misunderstanding your code, aren't you iterating over the 1000 different loss function candidates while *simultaneously* doing 1000 consecutive passes over the neural net? Meaning, the loss will naturally be lower during later iterations since you've already done a bunch of backward passes, so the biggest loss improvements would always be stacked towards the beginning of the 1000 iterations, right? Won't that bias your optimal learning rate calculation towards the first few candidates?

Can anyone explain to me, why looking at loss plotted against exponent of the learning rate () the conclusion is that lr<0.1 "is way too low"? For me, its where the loss is actually getting lower, isn't it?

splitting up the dataset into train/val/test splits and why
![To break the data to training, developement and test, one can also use torch.tensor_split.n1 = int(0.8 * X.shape[0])n2 = int(0.9 * X.shape[0])Xtr, Xdev, Xts = X.tensor_split((n1, n2), dim=0)Ytr, Ydev, Yts = Y.tensor_split((n1, n2), dim=0) - Building makemore Part 2: MLP](https://img.youtube.com/vi/TCH_1BHY58I/mqdefault.jpg)
To break the data to training, developement and test, one can also use torch.tensor_split.n1 = int(0.8 * X.shape[0])n2 = int(0.9 * X.shape[0])Xtr, Xdev, Xts = X.tensor_split((n1, n2), dim=0)Ytr, Ydev, Yts = Y.tensor_split((n1, n2), dim=0)

I'm confused at why care must be taking with how many times you can use the test dataset as the model will learn from it. Is this because there is no equivalent of 'torch.no_grad()' for LLMs - will the LLM always update the weights when given data?

Thank you for the lectures! @ Made me chuckle

It can take days!! How can someone sleep with such pressure

:

experiment: larger hidden layer

I also just noticed, he explicitly mentions these fluctuations at . Doh!

around - the reason why we're not "overfitting" with the larger number of params might be the context size. with a context of 3, no number of params will remove the inherent uncertainty.

visualizing the character embeddings

Fascinating how the vowels end up clustered together!

experiment: larger embedding size

: it should be 10 dimensional embeddings for each *character* not word in this character-level language model.

you shouldn't have plotted stepi variable against the loss :D it could have worked if you'd ploted out just plt.plot(loss_history) or applied two different colours for those two runs

The plot of the steps and losses after running the training loop multiple times (~ mins) https://youtu.be/TCH_1BHY58I?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&t=4233) would be wrong because stepi array keeps appending the same indices [0, 50000). I expect the graph to just start getting more unstable and unstable.

summary of our final code, conclusion

sampling from the model

google collab (new!!) notebook advertisement
