
intro: why you should care & fun history

@ "it was barely a programming language"

starter code

Love your lectures, they are crystal clear. From , I only find the notation dlogprobs (et similia) a bit misleading, since it denotes the derivative of the loss with respect to the parameters logprobs. I would use something more verbose like dloss_dlogprobs. However, I understand you did it for coherence with torch.

exercise 1: backproping the atomic compute graph

) for the full batch. Whereas in your answer in the videoat it's of size (32,27) only. Can you please clear this confusion for me Andrej? I think there's some fundamental flaw in my understanding 😭😭Is it because in the end we are calling .backward() on a scalar value? 😭

are logprobs and logits same? at

) At time - if probs are very close to 1, that doesn't mean that the network is predicting the next character correctly. If it's close to 1 and its corresponding gradient from dlogprobs is non-zero, only then that means that the network does the prediction correctly.

andrej fard

explained on

Is there a disadvantage to using (logits == logit_maxes).float() to pick out the maximum indices at ?

Thats so Cute. 😆

sprinkling Andrej magic through out the video - had me cracking at

At , Low Budget Production LOL

At around instead of differentiating the explicit expanded form of a matrix multiplication and then realizing that the result is again some matrix multiplication you can actually show more generally that the backprop operation of a linear transformation is always the Hermitian adjoint of that transformation. For matrix multiplication the Hermitian adjoint is just given by multiplication of the transposed matrix. This is especially useful for more complicated transformations like convolutions, just imagine doing these calculations on the completely written out expression of a convolution. This also explains the duality between summation and replication mentioned at

I arrived at dh just figuring out by the size of the matrix, and then I continued with your video and you just did all the derivatives and I taught... I am so dumb, I should I have done that, but then you say " now I tell you a secret I normally do... .... hahahahahhaha

if you scroll down, Wolfram Alpha provides 1 - x^2 + 2/3x^4 + O(x^5) as series expansion at x=0 of the derivative of tanh(x), which is the same as the series expansion for 1-tanh(x)^2.

brief digression: bessel’s correction in batchnorm

your attention to detail here on the variance of arrays is out of this world

The reason for using biased variance in training and unbiased during inference(running var estimation) is that during the training in one mini-batch we don't care about the complete dataset. The mini-batch is enough as it is the one at the moment we are working on. In the code also you are using the mean and var of that moment to run batchnorm. But during inference we need the mean and variance of the complete data, that is what the bessel's correction is for. If we have the access to the complete data we don't need to use the Bessel's correction, we have the full data. But if we are using small sample(mini-batch) to estimate the variance of the complete data we need Bessel's correction. If we used direct variance calculation instead of this running var we can completely skip the Bessel's correction.

best part

since the adjoint of summation is replication (and vice versa).

i've noticed that althoughdbnvar/(n-1) # (1, 64) doesn't have the same size as the bndiff2 term (32, 64), it still works fine during the backprop, because (1,64) vector broadcasts well on (32,64).And such solution is more optimal from the perspective of storage and calculation

around , a simpler approach might be to just directly multiply like this: dbndiff2 = 1/(n-1) * dbnvar
![At around dbnmeani should probably have keepdim=True, since otherwise you're removing the row dimension making it of shape [64], while bnmeani was originally [1, 64]. But I guess it still magically works because of broadcasting in the backprop and in the cmp :) - Building makemore Part 4: Becoming a Backprop Ninja](https://img.youtube.com/vi/q8SA3rM6ckI/mqdefault.jpg)
At around dbnmeani should probably have keepdim=True, since otherwise you're removing the row dimension making it of shape [64], while bnmeani was originally [1, 64]. But I guess it still magically works because of broadcasting in the backprop and in the cmp :)

Such a great video for really understanding the detail under the hood! And lol at the momentary disappointment at just before realizing the calculation wasn't complete yet 😂

/n * dbnmeani``` during . Makes no difference mathematically, but theres nothing like finding oopsies in your code :P

I believe the loop implementing the final derivative at can be vectorized if you just rewrite the selection operation as a matrix operation, then do a matmul derivative like done elsewhere in the video:
![Thank you!Also, my implementation for dC atdC = torch.zeros_like(C)dC[Xb] += demb - Building makemore Part 4: Becoming a Backprop Ninja](https://img.youtube.com/vi/q8SA3rM6ckI/mqdefault.jpg)
Thank you!Also, my implementation for dC atdC = torch.zeros_like(C)dC[Xb] += demb
![One-liner for dC ():dC = (F.one_hot(Xb, num_classes=C.shape[0]).unsqueeze(-1) * demb.unsqueeze(2)).sum((0, 1)) - Building makemore Part 4: Becoming a Backprop Ninja](https://img.youtube.com/vi/q8SA3rM6ckI/mqdefault.jpg)
One-liner for dC ():dC = (F.one_hot(Xb, num_classes=C.shape[0]).unsqueeze(-1) * demb.unsqueeze(2)).sum((0, 1))

I was able to accumulate the dC without a "for" loop using this code:dC = torch.zeros_like(C)dC.index_add_(0, Xb.flatten(), demb.flatten(0, 1))
![I struggled through everything to make sure I found answers before seeing the video solution, the one-line solution I got for dC was:dC = F.one_hot(Xb.view(-1), num_classes=C.shape[0]).float().T @ demb.view(-1, C.shape[-1])Don't ask me for an intuitive explanation, I just fiddled until something worked... (sort-of inspired by how earlier on in the series you showed that C[Xb] is just a more efficient version of using F.one_hot with matrix multiplication)Also, for whatever reason I can't get dhpreact to be exact, only approximate, even using your exact same code to calculate it? So I just ended up doing dhpreact = hpreact.grad#(1.0 - h**2) * dh to make sure its effect didn't cascade further down the gradient calculations. Any idea why this would be the case? - Building makemore Part 4: Becoming a Backprop Ninja](https://img.youtube.com/vi/q8SA3rM6ckI/mqdefault.jpg)
I struggled through everything to make sure I found answers before seeing the video solution, the one-line solution I got for dC was:dC = F.one_hot(Xb.view(-1), num_classes=C.shape[0]).float().T @ demb.view(-1, C.shape[-1])Don't ask me for an intuitive explanation, I just fiddled until something worked... (sort-of inspired by how earlier on in the series you showed that C[Xb] is just a more efficient version of using F.one_hot with matrix multiplication)Also, for whatever reason I can't get dhpreact to be exact, only approximate, even using your exact same code to calculate it? So I just ended up doing dhpreact = hpreact.grad#(1.0 - h**2) * dh to make sure its effect didn't cascade further down the gradient calculations. Any idea why this would be the case?

I managed to come up with a *vectorized* solution and it's just one line of code!*dC = F.one_hot(Xb.reshape(-1), num_classes=27).float().T @ demb.reshape((-1, n_emb))*

Here is the better implementation of the code:

Optimised dC calculation() instead of the for loop

To eliminate the for loop at time , I found this after a little searching. Very little experience with pytorch, so take with a grain of salt:dembflat = demb.view(-1,10)Xbflat = Xb.view(-1)dC = torch.zeros_like(C)dC.index_add_(0,Xbflat,dembflat)

P.S.: dC can be done with dC.index_add_(0, Xb.view(-1), demb.view(-1, 10)) ;)

: To backprop through the embedding matrix C, I used the following quick code, which does not need a for loop:
![can be vectorized using: dC = dC.index_add_(0, Xb.view(-1), demb.view(-1, C.shape[1])) - Building makemore Part 4: Becoming a Backprop Ninja](https://img.youtube.com/vi/q8SA3rM6ckI/mqdefault.jpg)
can be vectorized using: dC = dC.index_add_(0, Xb.view(-1), demb.view(-1, C.shape[1]))

So great videos, thank you so much! I tried to simplify dC (at in the video), but failed after some time, so asked chatgpt, and here is the shiny simple result:

exercise 2: cross entropy loss backward pass

at (just under the separation line for i≠y v i=y)?I understand from the above line that we are looking for the derivative of e^ly / Σe^lj. So, when we consider the denominator we would get e^ly * -(Σe^lj)^-2 = -e^ly / (Σe^lj)^2 but the solution multiplies it by e^li which I do not quite get. Cheers!

The calculus at is way too complicated. Start with -log( e^l_i/sum_j e^l_j ) = -l_i + log(sum_j e^l_j) before you differentiate. d -l_i/dl_y = -1 if i=y of course and d -log(sum_j e^l_j)/d l_y = -e^l_y / sum_j e^l_j = softmax(l_y) and you're done.

At (exercise 2, near the end, while deriving dlogits for i != y): why did you substitute 0 for e**lj ?
![) At about - I think the gradient that you calculate is for norm_logits and not for logits. It looks like they are approximately equal by chance. I think this is the correct implementation:dnorm_logits = F.softmax(norm_logits, 1)dnorm_logits[range(n), Yb] -= 1dnorm_logits /= ndlogit_maxes = -dnorm_logits.sum(1, keepdim=True)dlogits = dnorm_logits.clone()dlogits[range(n), logits.max(1).indices] += dlogit_maxes.view(-1) - Building makemore Part 4: Becoming a Backprop Ninja](https://img.youtube.com/vi/q8SA3rM6ckI/mqdefault.jpg)
) At about - I think the gradient that you calculate is for norm_logits and not for logits. It looks like they are approximately equal by chance. I think this is the correct implementation:dnorm_logits = F.softmax(norm_logits, 1)dnorm_logits[range(n), Yb] -= 1dnorm_logits /= ndlogit_maxes = -dnorm_logits.sum(1, keepdim=True)dlogits = dnorm_logits.clone()dlogits[range(n), logits.max(1).indices] += dlogit_maxes.view(-1)

I'm really confused about calculations at (the lower part of the paper about `if i <> j` etc). It says 'product rule, power rule, ..." How do I use product rule to take a derivative of Softmax?PS I asked ChatGPT and it explained to me that I need to use Quotient rule :)

Question: Why is this () true not only for dlogits, but also for dW2, db2, db1, and not true for dW1?

He really made me realize something at it kicked in 🔥

exercise 3: batch norm layer backward pass

in the WHOA:) part, should there be a "-" in front of 2? although it doesn't really matter as the final result is 0. but why is it?

Question: At , you conclude in the last derivation step that d sigma^2 / d x_i = 2 / (m-1) * (x_i- mu). This would be correct if mu were just a constant, but in fact, mu is also a function of x_i: mu(x_i) = 1/m. So how does this cancel out so that you still end up with your simple expression?

At the camera start giving up... so do I... 🤣

i'm not totaly sure that this is a good solution to calculate the derivative using "bnraw" variable, since it is calculated in the later steps of BN. Thus, there's no use in hpreact_fast as we have to do all the same arithmetics in parallel in order to fetch bnraw.My solution is not the best one, but still:

exercise 4: putting it all together
