Building makemore Part 2: MLP

We implement a multilayer perceptron (MLP) character-level language model. In this video we also introduce many basics of machine learning (e.g. model training, learning rate tuning, hyperparameters, evaluation, train/dev/test splits, under/overfitting, etc.).

Links:
- makemore on github: https://github.com/karpathy/makemore
- jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part2_mlp.ipynb
- collab notebook (new)!!!: https://colab.research.google.com/drive/1YIfmkftLrz6MPTOO9Vwqrop2Q5llHIGK?usp=sharing
- Bengio et al. 2003 MLP language model paper (pdf): https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- (new) Neural Networks: Zero to Hero series Discord channel: https://discord.gg/3zy8kqD9Cp , for people who'd like to chat more and go beyond youtube comments

Useful links:
- PyTorch internals ref http://blog.ezyang.com/2019/05/pytorch-internals/

Exercises:
- E01: Tune the hyperparameters of the training to beat my best validation loss of 2.2
- E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?
- E03: Read the Bengio et al 2003 paper (link above), implement and try any idea from the paper. Did it work?

Chapters:
00:00:00 intro
00:01:48 Bengio et al. 2003 (MLP language model) paper walkthrough
00:09:03 (re-)building our training dataset
00:12:19 implementing the embedding lookup table
00:18:35 implementing the hidden layer + internals of torch.Tensor: storage, views
00:29:15 implementing the output layer
00:29:53 implementing the negative log likelihood loss
00:32:17 summary of the full network
00:32:49 introducing F.cross_entropy and why
00:37:56 implementing the training loop, overfitting one batch
00:41:25 training on the full dataset, minibatches
00:45:40 finding a good initial learning rate
00:53:20 splitting up the dataset into train/val/test splits and why
01:00:49 experiment: larger hidden layer
01:05:27 visualizing the character embeddings
01:07:16 experiment: larger embedding size
01:11:46 summary of our final code, conclusion
01:13:24 sampling from the model
01:14:55 google collab (new!!) notebook advertisement

#deep learning #neural network #multilayer perceptron #nlp #language model

Building makemore Part 2: MLP

intro

,

[<1809.89it/s]Last Loss: 2.403459072113037Best Loss: 1.4457638263702393 At Epoch: 25480============================================================

PS. At I was just uber curious about his previous searches, so I google them:

Bengio et al. 2003 (MLP language model) paper walkthrough

Why space is small? Even in two-dimensional space you can place an infinite number of points

(re-)building our training dataset

implementing the embedding lookup table

Every time I think I finally understand what's happening, he does something like this: 😅

implementing the hidden layer + internals of torch.Tensor: storage, views

-dimensional vertically scrollable space to describe the functions of PyTorch ()

at I think it's supposed to be first letter not first word. It's first word in the paper but first letter in the example

At , when he says words does he mean the 3 character sequence that was made by block size? And, so, when he refers to the picture behind him, does he mean each of those three blocks represents a indice in the block_size array?

what about just `emb_reshaped = emb.reshape((emb.shape[0], emb.shape[1]*emb.shape[2]))` ?

Of course! Memory itself is a one dimensional "tensor". :D

for the PyTorch internals video (@)

Please create the "entire video about the internals of pytorch" that you mentioned in . And thank you so much for the content, Andrej !!

at minute mark at the moment and gotta say, pytorch is amazing. so wonderful how easy they make it for devs with those small tricks.

proverbs>You will have plenty of goats’ milk to feed your family and to nourish your female servants.

implementing the output layer

implementing the negative log likelihood loss

What's tanh?

"ideally all of these numbers here of course are one because then we are correctly predicting the next character" hmmmmmm it's reasonable to say these numbers are high, put not one, If the probability here is one, that will exclude any chance of other characters having similar context.

summary of the full network

introducing F.cross_entropy and why

Since probs are invariant to an offset applied to logits, it's fun to plot the drift in the mean or sum of b2. Looks like Brownian motion.

, who would tell you this when you are reading from a book. Exceptional teaching ability

implementing the training loop, overfitting one batch

pfeeeewwww 😳

training on the full dataset, minibatches

life lesson: much better to have an approximate gradient and take many steps than have an exact gradient and take a few steps

Awesome videos, thank you for that! I have a question though about , "finding a good initial learning rate", which is either a mistake in the video or I misunderstood something.

It seems it is slightly different from the approach presented here. Looking at the , it looks like for each iteration, we randomly select a min batch of size 32 from the whole training set, and update the parameters, then go on to the next iteration.

finding a good initial learning rate

I don't quite understand the part of finding a good initial learning rate. Why the lowest point of loss value indicates the best learning rate? It takes some time for the loss value to decrease, right?

On I was waiting fot Karpathy's constant to appear. Thank you for the lecture, Andrej

At Andrej says that the learning rate would be low in the beginning and high at the end. Why was it set like that? My intuition is that the learning rate should be in the opposite order.

Question about , in the plot, y axis is the loss, and the x axis is learning rate, but x axis is also the step number. How do you know whether the y axis change is because of learning rate difference or step number increase?

Can anyone explain to me, why looking at loss plotted against exponent of the learning rate () the conclusion is that lr<0.1 "is way too low"? For me, its where the loss is actually getting lower, isn't it?

splitting up the dataset into train/val/test splits and why

To break the data to training, developement and test, one can also use torch.tensor_split.n1 = int(0.8 * X.shape[0])n2 = int(0.9 * X.shape[0])Xtr, Xdev, Xts = X.tensor_split((n1, n2), dim=0)Ytr, Ydev, Yts = Y.tensor_split((n1, n2), dim=0)

I'm confused at why care must be taking with how many times you can use the test dataset as the model will learn from it. Is this because there is no equivalent of 'torch.no_grad()' for LLMs - will the LLM always update the weights when given data?

Thank you for the lectures! @ Made me chuckle

It can take days!! How can someone sleep with such pressure

:

experiment: larger hidden layer

I also just noticed, he explicitly mentions these fluctuations at . Doh!

around - the reason why we're not "overfitting" with the larger number of params might be the context size. with a context of 3, no number of params will remove the inherent uncertainty.

visualizing the character embeddings

Fascinating how the vowels end up clustered together!

experiment: larger embedding size

: it should be 10 dimensional embeddings for each *character* not word in this character-level language model.

you shouldn't have plotted stepi variable against the loss :D it could have worked if you'd ploted out just plt.plot(loss_history) or applied two different colours for those two runs

summary of our final code, conclusion

sampling from the model

google collab (new!!) notebook advertisement

Adrej is learning youtube tricks 😅

Andrej Karpathy

Timetable

よく話題になっている単語

: it should be 10 dimensional embeddings for each character not word in this character-level language model.