Building makemore Part 3: Activations & Gradients, BatchNorm

We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and introduce the first modern innovation that made doing so much easier: Batch Normalization. Residual connections and the Adam optimizer remain notable todos for later video.

Links:
- makemore on github: https://github.com/karpathy/makemore
- jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part3_bn.ipynb
- collab notebook: https://colab.research.google.com/drive/1H5CSy-OnisagUgDUXhHwo1ng2pjKHYSN?usp=sharing
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- Discord channel: https://discord.gg/3zy8kqD9Cp

Useful links:
- "Kaiming init" paper: https://arxiv.org/abs/1502.01852
- BatchNorm paper: https://arxiv.org/abs/1502.03167
- Bengio et al. 2003 MLP language model paper (pdf): https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
- Good paper illustrating some of the problems with batchnorm in practice: https://arxiv.org/abs/2105.07576

Exercises:
- E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
- E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

Chapters:
00:00:00 intro
00:01:22 starter code
00:04:19 fixing the initial loss
00:12:59 fixing the saturated tanh
00:27:53 calculating the init scale: “Kaiming init”
00:40:40 batch normalization
01:03:07 batch normalization: summary
01:04:50 real example: resnet50 walkthrough
01:14:10 summary of the lecture
01:18:35 just kidding: part2: PyTorch-ifying the code
01:26:51 viz #1: forward pass activations statistics
01:30:54 viz #2: backward pass gradient statistics
01:32:07 the fully linear case of no non-linearities
01:36:15 viz #3: parameter activation and gradient statistics
01:39:55 viz #4: update:data ratio over time
01:46:04 bringing back batchnorm, looking at the visualizations
01:51:34 summary of the lecture for real this time

#neural network #deep learning #makemore #batchnorm #batch normalization #pytorch

intro

2022年10月05日　

00:00:00 - 00:01:22

[-1. Implementing and refactoring neural networks for language modeling

Building makemore Part 3: Activations & Gradients, BatchNorm

Building makemore Part 3: Activations & Gradients, BatchNorm

intro

[-1. Implementing and refactoring neural networks for language modeling

]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks.

]Understanding neural net activations and gradients in training is crucial for optimizing architectures.

starter code

-[-11,000 parameters over 200,000 steps, achieving train and val loss of 2.16.

]Refactored code to optimize neural net with

[-2. Efficiency of torch.no_grad and neural net initialization issues

]Using torch.nograd decorator to prevent gradients computation.

]Using torch's no_grad makes computation more efficient by eliminating gradient tracking.

fixing the initial loss

-[-27, rapidly decreases to 1 or 2.

- Initial loss (): High initial loss (e.g., 27) indicates improper network initialization.- Softmax logits should be close to zero at initialization to produce a uniform probability distribution and expected loss.- This avoids confident mispredictions and the "hockey stick" loss curve.

]Network initialization causes high loss of

-[-27 characters, with roughly 1/27 probability for each.

]At initialization, the model aims for a uniform distribution among

]Neural net creates skewed probability distributions leading to high loss.

At , it's a bit subtle why it's called a loss because it's not immediately apparent with respect to what it is a loss. It seems it's the loss resulting from choosing the character having index i given the probability distribution stored in the tensor.

- Scaling down weights of the output layer can achieve this ().

-[-2.12-2.16

]Loss at initialization as expected, improved to

fixing the saturated tanh

- Saturated activations (): Tanh activations clustered around -1 and 1 indicate saturation, hindering gradient flow.- Saturated neurons update less frequently and impede training.

[-3. Neural network initialization

the night and day shift

not only sweating but also loosing hair :)

Andrej's transformation between and

so, no one is going to talk about how andrej grew a decade younger 🤔

was pretty quick 😉

-[-1 or 1, leading to a halt in back propagation.

]The chain rule with local gradient is affected when outputs of tanh are close to -

]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values.

- This can lead to dead neurons, which never activate and don't learn ().

- Scaling down weights of the hidden layer can help prevent saturation ().

@Andrej Karpathy Great video! A quick question: at Why a U shape is better than a Cup shape for the histogram of h? Don't we want h to be have some normal distribution, like hpreact?

-[-2.17 to 2.10 by fixing softmax and 10-inch layer issues.

]Optimization led to improved validation loss from

calculating the init scale: “Kaiming init”

- Kaiming initialization (): A principled approach to weight scaling, aiming for unit gaussian activations throughout the network.

]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets.

-[-0.2 shrinks gaussian with standard deviation 0.6.

]Scaling down by

- Calculates standard deviation based on fan-in and gain factor specific to the non-linearity used ().

]Initializing neural network weights for well-behaved activations, kaiming he et al.

- PyTorch offers torch.nn.init.kaiming_normal_ for this ().

modern inovations that makes things stable and that makes us not be super detailed and careful with the gradient and backprop issues. (self-note)

[-4. Neural net initialization and batch normalization

]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers.

]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization.

The standard deviation relation used to rescale the inital weights, will this only work in the case that the input data also has variance approximately 1 right?

batch normalization

*Batch Normalization (****):*- Concept: Normalizes activations within each batch to be roughly unit gaussian.- Controls activation scale, stabilizing training and mitigating the need for precise weight initialization.

-[-2015 enabled reliable training of deep neural nets.

]Batch normalization from

]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper.

- Implementation ():

- Normalizes activations by subtracting batch mean and dividing by batch standard deviation ().

]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance.

]Calculating standard deviation of activations, mean is average value of neuron's activation.

std should be a centralized moment (i.e. subtract the mean first) according to the paper although I see that PyTorch imp is the same as yours

Around , wouldn't adding scale and shift revert the previous normalization? Improper scale and shift parameters will still cause activation saturated.

Just to be clear, normalising the pre-activation neurons to have 0 mean and 1 std does not make them Gaussian distributed. The sum is only Gaussian distributed at initialisation, because we have initialised the weights to be normally distributed.

- Learnable gain and bias parameters allow the network to adjust the normalized distribution ().

]Back propagation guides distribution movement, adding scale and shift for final output

- Couples examples within a batch, leading to potential bugs and inconsistencies ().

Can anyone explain what he has said from to

[-5. Jittering and batch normalization in neural network training

- Offers a regularization effect due to coupling examples within a batch ().

]Padding input examples adds entropy, augments data, and regularizes neural nets.

?

- Requires careful handling at inference time due to batch dependency ().

]Batch normalization effectively controls activations and their distributions.

- Running mean and variance are tracked during training and used for inference ().- Caveats:

]Batch normalization paper introduces running mean and standard deviation estimation during training.

@Andrej Karpathy At , bnmean_running = (0.999 * bnmean_running) + (0.001 * bnmeani), why are you multiplying 0.999 with bnmean_running and 0.001 with bnmeani. Why this not works *bnmean_running = bnmean_running + bnmeani*

is basically an Infinite Impulse Response (IIR) filter

Can any one please tell that at , why did we take the numbers 0.999 and 0.001 specifically? I am new to neural networks and all of this is a bit overwhelming. Thanks

]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero.

[-6. Batch normalization and resnet in pytorch

@Andrej Karpathy At , bnmean_running = (0.999 * bnmean_running) + (0.001 * bnmeani), why are you multiplying 0.999 with bnmean_running and 0.001 with bnmeani. Why this not works bnmean_running = bnmean_running + bnmeani