Building makemore Part 4: Becoming a Backprop Ninja

We take the 2-layer MLP (with BatchNorm) from the previous video and backpropagate through it manually without using PyTorch autograd's loss.backward(): through the cross entropy loss, 2nd linear layer, tanh, batchnorm, 1st linear layer, and the embedding table. Along the way, we get a strong intuitive understanding about how gradients flow backwards through the compute graph and on the level of efficient Tensors, not just individual scalars like in micrograd. This helps build competence and intuition around how neural nets are optimized and sets you up to more confidently innovate on and debug modern neural networks.

!!!!!!!!!!!!
I recommend you work through the exercise yourself but work with it in tandem and whenever you are stuck unpause the video and see me give away the answer. This video is not super intended to be simply watched. The exercise is here:
https://colab.research.google.com/drive/1WV2oi2fh9XXyldh02wupFQX0wh5ZC-z-?usp=sharing
!!!!!!!!!!!!

Links:
- makemore on github: https://github.com/karpathy/makemore
- jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part4_backprop.ipynb
- collab notebook: https://colab.research.google.com/drive/1WV2oi2fh9XXyldh02wupFQX0wh5ZC-z-?usp=sharing
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- our Discord channel: https://discord.gg/3zy8kqD9Cp

Supplementary links:
- Yes you should understand backprop: https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b
- BatchNorm paper: https://arxiv.org/abs/1502.03167
- Bessel’s Correction: http://math.oxford.emory.edu/site/math117/besselCorrection/
- Bengio et al. 2003 MLP LM https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Chapters:
00:00:00 intro: why you should care & fun history
00:07:26 starter code
00:13:01 exercise 1: backproping the atomic compute graph
01:05:17 brief digression: bessel’s correction in batchnorm
01:26:31 exercise 2: cross entropy loss backward pass
01:36:37 exercise 3: batch norm layer backward pass
01:50:02 exercise 4: putting it all together
01:54:24 outro

#deep learning #backpropagation #neural network #language model #chain rule #tensors

Love your lectures, they are crystal clear. From , I only find the notation dlogprobs (et similia) a bit misleading, since it denotes the derivative of the loss with respect to the parameters logprobs. I would use something more verbose like dloss_dlogprobs. However, I understand you did it for coherence with torch.

2022年10月12日　 @Irenecus 様　

00:12:37 - 01:55:24

exercise 1: backproping the atomic compute graph

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　

00:13:01 - 01:05:17

) for the full batch. Whereas in your answer in the videoat it's of size (32,27) only. Can you please clear this confusion for me Andrej? I think there's some fundamental flaw in my understanding 😭😭Is it because in the end we are calling .backward() on a scalar value? 😭

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @ninjaturtle205 様　

00:20:00 - 01:55:24

are logprobs and logits same? at

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @mehul4mak 様　

00:20:00 - 01:55:24

) At time - if probs are very close to 1, that doesn't mean that the network is predicting the next character correctly. If it's close to 1 and its corresponding gradient from dlogprobs is non-zero, only then that means that the network does the prediction correctly.

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @tihomir.kozhuharski 様　

00:20:20 - 01:30:30

andrej fard

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @rout.network 様　

00:27:25 - 01:55:24

explained on

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @maximumwal 様　

00:36:30 - 01:55:24

Is there a disadvantage to using (logits == logit_maxes).float() to pick out the maximum indices at ?

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @michaelzellinger1250 様　

00:40:54 - 01:55:24

Thats so Cute. 😆

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @sivasoorya5950 様　

00:43:40 - 01:55:24

sprinkling Andrej magic through out the video - had me cracking at

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @arjunsinghyadav4273 様　

00:43:40 - 01:55:24

At , Low Budget Production LOL

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @akshatsingh6036 様　

00:45:25 - 01:55:24

At around instead of differentiating the explicit expanded form of a matrix multiplication and then realizing that the result is again some matrix multiplication you can actually show more generally that the backprop operation of a linear transformation is always the Hermitian adjoint of that transformation. For matrix multiplication the Hermitian adjoint is just given by multiplication of the transposed matrix. This is especially useful for more complicated transformations like convolutions, just imagine doing these calculations on the completely written out expression of a convolution. This also explains the duality between summation and replication mentioned at

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @ThomasHamm-p6n 様　

00:47:00 - 01:09:50

I arrived at dh just figuring out by the size of the matrix, and then I continued with your video and you just did all the derivatives and I taught... I am so dumb, I should I have done that, but then you say " now I tell you a secret I normally do... .... hahahahahhaha

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @rubenvicente4677 様　

00:49:45 - 01:55:24

if you scroll down, Wolfram Alpha provides 1 - x^2 + 2/3x^4 + O(x^5) as series expansion at x=0 of the derivative of tanh(x), which is the same as the series expansion for 1-tanh(x)^2.

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @obnoxiaaeristokles3872 様　

00:53:55 - 01:55:24

brief digression: bessel’s correction in batchnorm

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　

01:05:17 - 01:26:31

your attention to detail here on the variance of arrays is out of this world

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @ChristopherZhou-d3s 様　

01:06:20 - 01:55:24

The reason for using biased variance in training and unbiased during inference(running var estimation) is that during the training in one mini-batch we don't care about the complete dataset. The mini-batch is enough as it is the one at the moment we are working on. In the code also you are using the mean and var of that moment to run batchnorm. But during inference we need the mean and variance of the complete data, that is what the bessel's correction is for. If we have the access to the complete data we don't need to use the Bessel's correction, we have the full data. But if we are using small sample(mini-batch) to estimate the variance of the complete data we need Bessel's correction. If we used direct variance calculation instead of this running var we can completely skip the Bessel's correction.

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @siddhantverma532 様　

01:08:17 - 01:55:24

best part

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @sambhramshetty9385 様　

01:09:44 - 01:55:24

since the adjoint of summation is replication (and vice versa).

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @ThomasHamm-p6n 様　

01:09:50 - 01:55:24

i've noticed that althoughdbnvar/(n-1) # (1, 64) doesn't have the same size as the bndiff2 term (32, 64), it still works fine during the backprop, because (1,64) vector broadcasts well on (32,64).And such solution is more optimal from the perspective of storage and calculation

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @danieljaszczyszczykoeczews2616 様　

01:12:06 - 01:55:24

around , a simpler approach might be to just directly multiply like this: dbndiff2 = 1/(n-1) * dbnvar

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @MihikChaudhari 様　

01:12:20 - 01:55:24

At around dbnmeani should probably have keepdim=True, since otherwise you're removing the row dimension making it of shape [64], while bnmeani was originally [1, 64]. But I guess it still magically works because of broadcasting in the backprop and in the cmp :)

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @zip753 様　

01:16:00 - 01:55:24

Such a great video for really understanding the detail under the hood! And lol at the momentary disappointment at just before realizing the calculation wasn't complete yet 😂

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @nirajs 様　

01:16:20 - 01:55:24

/n * dbnmeani``` during . Makes no difference mathematically, but theres nothing like finding oopsies in your code :P

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @pratt3000 様　

01:18:20 - 01:55:24

I believe the loop implementing the final derivative at can be vectorized if you just rewrite the selection operation as a matrix operation, then do a matmul derivative like done elsewhere in the video:

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @TheOrowa 様　

01:24:21 - 01:55:24

Thank you!Also, my implementation for dC atdC = torch.zeros_like(C)dC[Xb] += demb

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @pheikerrr1332 様　

01:24:21 - 01:55:24

One-liner for dC ():dC = (F.one_hot(Xb, num_classes=C.shape[0]).unsqueeze(-1) * demb.unsqueeze(2)).sum((0, 1))

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @mynameisZhenyaArt_ 様　

01:24:22 - 01:55:24

I was able to accumulate the dC without a "for" loop using this code:dC = torch.zeros_like(C)dC.index_add_(0, Xb.flatten(), demb.flatten(0, 1))

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @bradyshows 様　

01:24:25 - 01:55:24

I struggled through everything to make sure I found answers before seeing the video solution, the one-line solution I got for dC was:dC = F.one_hot(Xb.view(-1), num_classes=C.shape[0]).float().T @ demb.view(-1, C.shape[-1])Don't ask me for an intuitive explanation, I just fiddled until something worked... (sort-of inspired by how earlier on in the series you showed that C[Xb] is just a more efficient version of using F.one_hot with matrix multiplication)Also, for whatever reason I can't get dhpreact to be exact, only approximate, even using your exact same code to calculate it? So I just ended up doing dhpreact = hpreact.grad#(1.0 - h**2) * dh to make sure its effect didn't cascade further down the gradient calculations. Any idea why this would be the case?

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @tisajokt 様　

01:24:26 - 01:55:24

I managed to come up with a vectorized solution and it's just one line of code!dC = F.one_hot(Xb.reshape(-1), num_classes=27).float().T @ demb.reshape((-1, n_emb))

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @marathonour 様　

01:24:30 - 01:55:24

Here is the better implementation of the code:

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @muhammadbaqir3736 様　

01:25:00 - 01:55:24

Optimised dC calculation() instead of the for loop

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @rahulbaburajan 様　

01:25:00 - 01:55:24

To eliminate the for loop at time , I found this after a little searching. Very little experience with pytorch, so take with a grain of salt:dembflat = demb.view(-1,10)Xbflat = Xb.view(-1)dC = torch.zeros_like(C)dC.index_add_(0,Xbflat,dembflat)

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @jimaylan6140 様　

01:25:21 - 01:55:24

P.S.: dC can be done with dC.index_add_(0, Xb.view(-1), demb.view(-1, 10)) ;)

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @b0nce 様　

01:25:47 - 01:55:24

: To backprop through the embedding matrix C, I used the following quick code, which does not need a for loop:

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @himanshukhanchandani2705 様　

01:26:00 - 01:55:24

can be vectorized using: dC = dC.index_add_(0, Xb.view(-1), demb.view(-1, C.shape[1]))

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @astropsyduck 様　

01:26:00 - 01:55:24

So great videos, thank you so much! I tried to simplify dC (at in the video), but failed after some time, so asked chatgpt, and here is the shiny simple result:

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @gergelyfarago8544 様　

01:26:08 - 01:55:24

exercise 2: cross entropy loss backward pass

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　

01:26:31 - 01:36:37

at (just under the separation line for i≠y v i=y)?I understand from the above line that we are looking for the derivative of e^ly / Σe^lj. So, when we consider the denominator we would get e^ly * -(Σe^lj)^-2 = -e^ly / (Σe^lj)^2 but the solution multiplies it by e^li which I do not quite get. Cheers!

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @aron_g. 様　

01:30:11 - 01:55:24

The calculus at is way too complicated. Start with -log( e^l_i/sum_j e^l_j ) = -l_i + log(sum_j e^l_j) before you differentiate. d -l_i/dl_y = -1 if i=y of course and d -log(sum_j e^l_j)/d l_y = -e^l_y / sum_j e^l_j = softmax(l_y) and you're done.

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @jrkeat9358 様　

01:30:13 - 01:55:24

At (exercise 2, near the end, while deriving dlogits for i != y): why did you substitute 0 for e**lj ?

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @YazhiniSP 様　

01:30:26 - 01:55:24

) At about - I think the gradient that you calculate is for norm_logits and not for logits. It looks like they are approximately equal by chance. I think this is the correct implementation:dnorm_logits = F.softmax(norm_logits, 1)dnorm_logits[range(n), Yb] -= 1dnorm_logits /= ndlogit_maxes = -dnorm_logits.sum(1, keepdim=True)dlogits = dnorm_logits.clone()dlogits[range(n), logits.max(1).indices] += dlogit_maxes.view(-1)

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @tihomir.kozhuharski 様　

01:30:30 - 01:55:24

I'm really confused about calculations at (the lower part of the paper about `if i <> j` etc). It says 'product rule, power rule, ..." How do I use product rule to take a derivative of Softmax?PS I asked ChatGPT and it explained to me that I need to use Quotient rule :)

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @onamixt 様　

01:30:38 - 01:55:24

Question: Why is this () true not only for dlogits, but also for dW2, db2, db1, and not true for dW1?

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @arvyzukai 様　

01:33:59 - 01:55:24

He really made me realize something at it kicked in 🔥

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @enchanted_swiftie 様　

01:36:05 - 01:55:24

exercise 3: batch norm layer backward pass

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　

01:36:37 - 01:50:02

in the WHOA:) part, should there be a "-" in front of 2? although it doesn't really matter as the final result is 0. but why is it?

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @tianwang 様　

01:43:08 - 01:55:24

Question: At , you conclude in the last derivation step that d sigma^2 / d x_i = 2 / (m-1) * (x_i- mu). This would be correct if mu were just a constant, but in fact, mu is also a function of x_i: mu(x_i) = 1/m. So how does this cancel out so that you still end up with your simple expression?

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @lukaleko7208 様　

01:45:54 - 01:55:24

At the camera start giving up... so do I... 🤣

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @Jason-Lo 様　

01:47:50 - 01:55:24

i'm not totaly sure that this is a good solution to calculate the derivative using "bnraw" variable, since it is calculated in the later steps of BN. Thus, there's no use in hpreact_fast as we have to do all the same arithmetics in parallel in order to fetch bnraw.My solution is not the best one, but still:

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　 @danieljaszczyszczykoeczews2616 様　

01:50:00 - 01:55:24

exercise 4: putting it all together

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　

01:50:02 - 01:54:24

outro

Building makemore Part 4: Becoming a Backprop Ninja

2022年10月12日　

01:54:24 - 01:55:24

チャンネル登録

Andrej Karpathy

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。動画はYouTube公式の動画プレイヤーで再生されるため、再生数・収益などはすべて元動画に還元されます。

概要カレンダータイムライン動画一覧タイムテーブル YouTube配信チャンネル分析

Timetable

動画タイムテーブル

タイムテーブルが見つかりませんでした。

Building makemore Part 4: Becoming a Backprop Ninja

intro: why you should care & fun history

@ "it was barely a programming language"

starter code

exercise 1: backproping the atomic compute graph

) for the full batch. Whereas in your answer in the videoat it's of size (32,27) only. Can you please clear this confusion for me Andrej? I think there's some fundamental flaw in my understanding 😭😭Is it because in the end we are calling .backward() on a scalar value? 😭

are logprobs and logits same? at

) At time - if probs are very close to 1, that doesn't mean that the network is predicting the next character correctly. If it's close to 1 and its corresponding gradient from dlogprobs is non-zero, only then that means that the network does the prediction correctly.

andrej fard

explained on

Is there a disadvantage to using (logits == logit_maxes).float() to pick out the maximum indices at ?

Thats so Cute. 😆

sprinkling Andrej magic through out the video - had me cracking at

At , Low Budget Production LOL

I arrived at dh just figuring out by the size of the matrix, and then I continued with your video and you just did all the derivatives and I taught... I am so dumb, I should I have done that, but then you say " now I tell you a secret I normally do... .... hahahahahhaha

if you scroll down, Wolfram Alpha provides 1 - x^2 + 2/3x^4 + O(x^5) as series expansion at x=0 of the derivative of tanh(x), which is the same as the series expansion for 1-tanh(x)^2.

brief digression: bessel’s correction in batchnorm

your attention to detail here on the variance of arrays is out of this world

best part

since the adjoint of summation is replication (and vice versa).

i've noticed that althoughdbnvar/(n-1) # (1, 64) doesn't have the same size as the bndiff2 term (32, 64), it still works fine during the backprop, because (1,64) vector broadcasts well on (32,64).And such solution is more optimal from the perspective of storage and calculation

around , a simpler approach might be to just directly multiply like this: dbndiff2 = 1/(n-1) * dbnvar

At around dbnmeani should probably have keepdim=True, since otherwise you're removing the row dimension making it of shape [64], while bnmeani was originally [1, 64]. But I guess it still magically works because of broadcasting in the backprop and in the cmp :)

Such a great video for really understanding the detail under the hood! And lol at the momentary disappointment at just before realizing the calculation wasn't complete yet 😂

/n * dbnmeani``` during . Makes no difference mathematically, but theres nothing like finding oopsies in your code :P

I believe the loop implementing the final derivative at can be vectorized if you just rewrite the selection operation as a matrix operation, then do a matmul derivative like done elsewhere in the video:

Thank you!Also, my implementation for dC atdC = torch.zeros_like(C)dC[Xb] += demb

One-liner for dC ():dC = (F.one_hot(Xb, num_classes=C.shape[0]).unsqueeze(-1) * demb.unsqueeze(2)).sum((0, 1))

I was able to accumulate the dC without a "for" loop using this code:dC = torch.zeros_like(C)dC.index_add_(0, Xb.flatten(), demb.flatten(0, 1))

I managed to come up with a *vectorized* solution and it's just one line of code!*dC = F.one_hot(Xb.reshape(-1), num_classes=27).float().T @ demb.reshape((-1, n_emb))*

Here is the better implementation of the code:

Optimised dC calculation() instead of the for loop

To eliminate the for loop at time , I found this after a little searching. Very little experience with pytorch, so take with a grain of salt:dembflat = demb.view(-1,10)Xbflat = Xb.view(-1)dC = torch.zeros_like(C)dC.index_add_(0,Xbflat,dembflat)

P.S.: dC can be done with dC.index_add_(0, Xb.view(-1), demb.view(-1, 10)) ;)

: To backprop through the embedding matrix C, I used the following quick code, which does not need a for loop:

can be vectorized using: dC = dC.index_add_(0, Xb.view(-1), demb.view(-1, C.shape[1]))

So great videos, thank you so much! I tried to simplify dC (at in the video), but failed after some time, so asked chatgpt, and here is the shiny simple result:

exercise 2: cross entropy loss backward pass

The calculus at is way too complicated. Start with -log( e^l_i/sum_j e^l_j ) = -l_i + log(sum_j e^l_j) before you differentiate. d -l_i/dl_y = -1 if i=y of course and d -log(sum_j e^l_j)/d l_y = -e^l_y / sum_j e^l_j = softmax(l_y) and you're done.

At (exercise 2, near the end, while deriving dlogits for i != y): why did you substitute 0 for e**lj ?

I'm really confused about calculations at (the lower part of the paper about `if i <> j` etc). It says 'product rule, power rule, ..." How do I use product rule to take a derivative of Softmax?PS I asked ChatGPT and it explained to me that I need to use Quotient rule :)

Question: Why is this () true not only for dlogits, but also for dW2, db2, db1, and not true for dW1?

He really made me realize something at it kicked in 🔥

exercise 3: batch norm layer backward pass

in the WHOA:) part, should there be a "-" in front of 2? although it doesn't really matter as the final result is 0. but why is it?

Question: At , you conclude in the last derivation step that d sigma^2 / d x_i = 2 / (m-1) * (x_i- mu). This would be correct if mu were just a constant, but in fact, mu is also a function of x_i: mu(x_i) = 1/m. So how does this cancel out so that you still end up with your simple expression?

At the camera start giving up... so do I... 🤣

exercise 4: putting it all together

outro

Andrej Karpathy

Timetable

よく話題になっている単語

I managed to come up with a vectorized solution and it's just one line of code!dC = F.one_hot(Xb.reshape(-1), num_classes=27).float().T @ demb.reshape((-1, n_emb))