- Building makemore Part 3: Activations & Gradients, BatchNorm

Building makemore Part 3: Activations & Gradients, BatchNorm

We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to un...
We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and introduce the first modern innovation that made doing so much easier: Batch Normalization. Residual connections and the Adam optimizer remain notable todos for later video.

Links:
- makemore on github: https://github.com/karpathy/makemore
- jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part3_bn.ipynb
- collab notebook: https://colab.research.google.com/drive/1H5CSy-OnisagUgDUXhHwo1ng2pjKHYSN?usp=sharing
- my website: https://karpathy.ai
- my twitter:
- Discord channel: https://discord.gg/3zy8kqD9Cp

Useful links:
- "Kaiming init" paper: https://arxiv.org/abs/1502.01852
- BatchNorm paper: https://arxiv.org/abs/1502.03167
- Bengio et al. 2003 MLP language model paper (pdf): https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
- Good paper illustrating some of the problems with batchnorm in practice: https://arxiv.org/abs/2105.07576

Exercises:
- E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
- E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

Chapters:
00:00:00 intro
00:01:22 starter code
00:04:19 fixing the initial loss
00:12:59 fixing the saturated tanh
00:27:53 calculating the init scale: “Kaiming init”
00:40:40 batch normalization
01:03:07 batch normalization: summary
01:04:50 real example: resnet50 walkthrough
01:14:10 summary of the lecture
01:18:35 just kidding: part2: PyTorch-ifying the code
01:26:51 viz #1: forward pass activations statistics
01:30:54 viz #2: backward pass gradient statistics
01:32:07 the fully linear case of no non-linearities
01:36:15 viz #3: parameter activation and gradient statistics
01:39:55 viz #4: update:data ratio over time
01:46:04 bringing back batchnorm, looking at the visualizations
01:51:34 summary of the lecture for real this time

#neural network #deep learning #makemore #batchnorm #batch normalization #pytorch
intro - Building makemore Part 3: Activations & Gradients, BatchNorm

intro

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
00:00:00 - 00:01:22
[-1. Implementing and refactoring neural networks for language modeling - Building makemore Part 3: Activations & Gradients, BatchNorm

[-1. Implementing and refactoring neural networks for language modeling

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:00:00 - 00:03:21
]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:00:30 - 00:00:31
]Understanding neural net activations and gradients in training is crucial for optimizing architectures. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Understanding neural net activations and gradients in training is crucial for optimizing architectures.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:01:03 - 00:02:06
starter code - Building makemore Part 3: Activations & Gradients, BatchNorm

starter code

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
00:01:22 - 00:04:19
-[-11,000 parameters over 200,000 steps, achieving train and val loss of 2.16. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-11,000 parameters over 200,000 steps, achieving train and val loss of 2.16.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:02:06 - 00:02:46
]Refactored code to optimize neural net with - Building makemore Part 3: Activations & Gradients, BatchNorm

]Refactored code to optimize neural net with

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:02:46 - 00:03:03
[-2. Efficiency of torch.no_grad and neural net initialization issues - Building makemore Part 3: Activations & Gradients, BatchNorm

[-2. Efficiency of torch.no_grad and neural net initialization issues

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:03:22 - 00:14:22
]Using torch.nograd decorator to prevent gradients computation. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Using torch.nograd decorator to prevent gradients computation.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:03:28 - 00:03:22
]Using torch's no_grad makes computation more efficient by eliminating gradient tracking. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Using torch's no_grad makes computation more efficient by eliminating gradient tracking.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:04:00 - 00:04:22
fixing the initial loss - Building makemore Part 3: Activations & Gradients, BatchNorm

fixing the initial loss

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
00:04:19 - 00:12:59
-[-27, rapidly decreases to 1 or 2. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-27, rapidly decreases to 1 or 2.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:04:22 - 00:04:50
-   Initial loss (): High initial loss (e.g., 27) indicates improper network initialization.- Softmax logits should be close to zero at initialization to produce a uniform probability distribution and expected loss.- This avoids confident mispredictions and the "hockey stick" loss curve. - Building makemore Part 3: Activations & Gradients, BatchNorm

- Initial loss (): High initial loss (e.g., 27) indicates improper network initialization.- Softmax logits should be close to zero at initialization to produce a uniform probability distribution and expected loss.- This avoids confident mispredictions and the "hockey stick" loss curve.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:04:25 - 00:09:28
]Network initialization causes high loss of - Building makemore Part 3: Activations & Gradients, BatchNorm

]Network initialization causes high loss of

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:04:50 - 00:05:00
-[-27 characters, with roughly 1/27 probability for each. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-27 characters, with roughly 1/27 probability for each.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:05:00 - 00:05:32
]At initialization, the model aims for a uniform distribution among - Building makemore Part 3: Activations & Gradients, BatchNorm

]At initialization, the model aims for a uniform distribution among

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:05:32 - 00:05:49
Hi Andrej, Thank you for the amazing set of lectures which elucidate multiple aspects of training a ML model. In the video  you mention that at the beginning of training, we expect the NN to have all equal probabilities I.e. 1/27 which implies that all logins should be close to 0. Using this logic you arrive at the fact that weight matrices should be initialized close to 0. How does one think about this for regression problems like autoencoders? What would a “good” starting output be? Is it still all zeros? - Building makemore Part 3: Activations & Gradients, BatchNorm

Hi Andrej, Thank you for the amazing set of lectures which elucidate multiple aspects of training a ML model. In the video you mention that at the beginning of training, we expect the NN to have all equal probabilities I.e. 1/27 which implies that all logins should be close to 0. Using this logic you arrive at the fact that weight matrices should be initialized close to 0. How does one think about this for regression problems like autoencoders? What would a “good” starting output be? Is it still all zeros?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @kaushik333ify 様 
00:06:06 - 01:55:58
]Neural net creates skewed probability distributions leading to high loss. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Neural net creates skewed probability distributions leading to high loss.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:06:19 - 00:12:08
At , it's a bit subtle why it's called a loss because it's not immediately apparent with respect to what it is a loss. It seems it's the loss resulting from choosing the character having index i given the probability distribution stored in the tensor. - Building makemore Part 3: Activations & Gradients, BatchNorm

At , it's a bit subtle why it's called a loss because it's not immediately apparent with respect to what it is a loss. It seems it's the loss resulting from choosing the character having index i given the probability distribution stored in the tensor.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @mihaidanila5584 様 
00:07:06 - 01:55:58
- Scaling down weights of the output layer can achieve this (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Scaling down weights of the output layer can achieve this ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:09:28 - 00:13:09
-[-2.12-2.16 - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-2.12-2.16

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:12:08 - 00:12:36
]Loss at initialization as expected, improved to - Building makemore Part 3: Activations & Gradients, BatchNorm

]Loss at initialization as expected, improved to

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:12:36 - 00:14:24
fixing the saturated tanh - Building makemore Part 3: Activations & Gradients, BatchNorm

fixing the saturated tanh

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
00:12:59 - 00:27:53
-   Saturated activations (): Tanh activations clustered around -1 and 1 indicate saturation, hindering gradient flow.- Saturated neurons update less frequently and impede training. - Building makemore Part 3: Activations & Gradients, BatchNorm

- Saturated activations (): Tanh activations clustered around -1 and 1 indicate saturation, hindering gradient flow.- Saturated neurons update less frequently and impede training.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:13:09 - 00:19:19
[-3. Neural network initialization - Building makemore Part 3: Activations & Gradients, BatchNorm

[-3. Neural network initialization

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:14:24 - 00:36:39
the night and day shift - Building makemore Part 3: Activations & Gradients, BatchNorm

the night and day shift

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @tianwang 様 
00:15:10 - 01:55:58
not only sweating but also loosing hair :) - Building makemore Part 3: Activations & Gradients, BatchNorm

not only sweating but also loosing hair :)

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @NailAllayarov 様 
00:15:13 - 01:55:58
Andrej's transformation between  and - Building makemore Part 3: Activations & Gradients, BatchNorm

Andrej's transformation between and

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @aayushsmarten 様 
00:15:14 - 00:15:16
so, no one is going to talk about how andrej grew a decade younger  🤔 - Building makemore Part 3: Activations & Gradients, BatchNorm

so, no one is going to talk about how andrej grew a decade younger 🤔

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @manastripathi7559 様 
00:15:15 - 01:55:58
was pretty quick 😉 - Building makemore Part 3: Activations & Gradients, BatchNorm

was pretty quick 😉

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @aayushsmarten 様 
00:15:16 - 01:55:58
-[-1 or 1, leading to a halt in back propagation. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-1 or 1, leading to a halt in back propagation.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:16:03 - 00:16:31
]The chain rule with local gradient is affected when outputs of tanh are close to - - Building makemore Part 3: Activations & Gradients, BatchNorm

]The chain rule with local gradient is affected when outputs of tanh are close to -

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:16:31 - 00:18:09
]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:18:38 - 00:26:03
- This can lead to dead neurons, which never activate and don't learn (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- This can lead to dead neurons, which never activate and don't learn ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:19:19 - 00:24:59
- Scaling down weights of the hidden layer can help prevent saturation (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Scaling down weights of the hidden layer can help prevent saturation ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:24:59 - 00:27:58
@Andrej Karpathy  Great video! A quick question: at  Why a U shape is better than a Cup shape for the histogram of h? Don't we want h to be have some normal distribution, like hpreact? - Building makemore Part 3: Activations & Gradients, BatchNorm

@Andrej Karpathy Great video! A quick question: at Why a U shape is better than a Cup shape for the histogram of h? Don't we want h to be have some normal distribution, like hpreact?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @rmajdodin 様 
00:25:43 - 01:55:58
-[-2.17 to 2.10 by fixing softmax and 10-inch layer issues. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-2.17 to 2.10 by fixing softmax and 10-inch layer issues.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:26:03 - 00:26:31
]Optimization led to improved validation loss from - Building makemore Part 3: Activations & Gradients, BatchNorm

]Optimization led to improved validation loss from

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:26:31 - 00:29:28
calculating the init scale: “Kaiming init” - Building makemore Part 3: Activations & Gradients, BatchNorm

calculating the init scale: “Kaiming init”

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
00:27:53 - 00:40:40
-   Kaiming initialization (): A principled approach to weight scaling, aiming for unit gaussian activations throughout the network. - Building makemore Part 3: Activations & Gradients, BatchNorm

- Kaiming initialization (): A principled approach to weight scaling, aiming for unit gaussian activations throughout the network.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:27:58 - 00:31:46
]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:30:02 - 00:30:17
-[-0.2 shrinks gaussian with standard deviation 0.6. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-0.2 shrinks gaussian with standard deviation 0.6.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:30:17 - 00:30:47
]Scaling down by - Building makemore Part 3: Activations & Gradients, BatchNorm

]Scaling down by

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:30:47 - 00:31:03
- Calculates standard deviation based on fan-in and gain factor specific to the non-linearity used (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Calculates standard deviation based on fan-in and gain factor specific to the non-linearity used ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:31:46 - 00:33:56
]Initializing neural network weights for well-behaved activations, kaiming he et al. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Initializing neural network weights for well-behaved activations, kaiming he et al.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:31:46 - 00:36:24
- PyTorch offers torch.nn.init.kaiming_normal_ for this (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- PyTorch offers torch.nn.init.kaiming_normal_ for this ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:33:56 - 00:40:49
modern inovations that makes things stable and that makes us not be super detailed and careful with the gradient and backprop issues. (self-note) - Building makemore Part 3: Activations & Gradients, BatchNorm

modern inovations that makes things stable and that makes us not be super detailed and careful with the gradient and backprop issues. (self-note)

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @mratanusarkar 様 
00:36:00 - 01:55:58
[-4. Neural net initialization and batch normalization - Building makemore Part 3: Activations & Gradients, BatchNorm

[-4. Neural net initialization and batch normalization

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:36:39 - 00:51:52
]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:36:55 - 00:36:39
]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:37:05 - 00:40:32
The standard deviation relation   used to rescale the inital weights, will this only work in the case that the input data also has variance approximately 1 right? - Building makemore Part 3: Activations & Gradients, BatchNorm

The standard deviation relation used to rescale the inital weights, will this only work in the case that the input data also has variance approximately 1 right?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @shivamrawat3897 様 
00:38:07 - 01:55:58
batch normalization - Building makemore Part 3: Activations & Gradients, BatchNorm

batch normalization

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
00:40:40 - 01:03:07
*Batch Normalization (****):*-   Concept: Normalizes activations within each batch to be roughly unit gaussian.- Controls activation scale, stabilizing training and mitigating the need for precise weight initialization. - Building makemore Part 3: Activations & Gradients, BatchNorm

*Batch Normalization (****):*- Concept: Normalizes activations within each batch to be roughly unit gaussian.- Controls activation scale, stabilizing training and mitigating the need for precise weight initialization.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:40:49 - 00:51:55
-[-2015 enabled reliable training of deep neural nets. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-2015 enabled reliable training of deep neural nets.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:40:51 - 00:41:13
]Batch normalization from - Building makemore Part 3: Activations & Gradients, BatchNorm

]Batch normalization from

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:41:13 - 00:41:39
]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:42:09 - 00:43:20
-   Implementation (): - Building makemore Part 3: Activations & Gradients, BatchNorm

- Implementation ():

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:42:17 - 00:42:41
- Normalizes activations by subtracting batch mean and dividing by batch standard deviation (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Normalizes activations by subtracting batch mean and dividing by batch standard deviation ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:42:41 - 00:45:54
]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:43:04 - 00:40:51
]Calculating standard deviation of activations, mean is average value of neuron's activation. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Calculating standard deviation of activations, mean is average value of neuron's activation.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:43:50 - 00:45:45
std should be a centralized moment (i.e. subtract the mean first) according to the paper although I see that PyTorch imp is the same as yours - Building makemore Part 3: Activations & Gradients, BatchNorm

std should be a centralized moment (i.e. subtract the mean first) according to the paper although I see that PyTorch imp is the same as yours

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @TheAIEpiphany 様 
00:44:30 - 01:40:25
Around , wouldn't adding scale and shift revert the previous normalization? Improper scale and shift parameters will still cause activation saturated. - Building makemore Part 3: Activations & Gradients, BatchNorm

Around , wouldn't adding scale and shift revert the previous normalization? Improper scale and shift parameters will still cause activation saturated.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @chuanjiang6931 様 
00:45:20 - 01:55:58
Just to be clear, normalising the pre-activation neurons to have 0 mean and 1 std does not make them Gaussian distributed. The sum is only Gaussian distributed at initialisation, because we have initialised the weights to be normally distributed. - Building makemore Part 3: Activations & Gradients, BatchNorm

Just to be clear, normalising the pre-activation neurons to have 0 mean and 1 std does not make them Gaussian distributed. The sum is only Gaussian distributed at initialisation, because we have initialised the weights to be normally distributed.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @knat3489 様 
00:45:30 - 01:55:58
- Learnable gain and bias parameters allow the network to adjust the normalized distribution (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Learnable gain and bias parameters allow the network to adjust the normalized distribution ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:45:54 - 00:54:38
]Back propagation guides distribution movement, adding scale and shift for final output - Building makemore Part 3: Activations & Gradients, BatchNorm

]Back propagation guides distribution movement, adding scale and shift for final output

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:46:16 - 00:51:52
- Couples examples within a batch, leading to potential bugs and inconsistencies (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Couples examples within a batch, leading to potential bugs and inconsistencies ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:50:20 - 00:54:03
Can anyone explain what he has said from  to - Building makemore Part 3: Activations & Gradients, BatchNorm

Can anyone explain what he has said from to

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @ITSimplifiedinHINDI 様 
00:51:50 - 00:53:00
[-5. Jittering and batch normalization in neural network training - Building makemore Part 3: Activations & Gradients, BatchNorm

[-5. Jittering and batch normalization in neural network training

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:51:52 - 01:01:35
- Offers a regularization effect due to coupling examples within a batch (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Offers a regularization effect due to coupling examples within a batch ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:51:55 - 00:42:17
]Padding input examples adds entropy, augments data, and regularizes neural nets. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Padding input examples adds entropy, augments data, and regularizes neural nets.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:52:37 - 00:53:44
? - Building makemore Part 3: Activations & Gradients, BatchNorm

?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @ITSimplifiedinHINDI 様 
00:53:00 - 01:55:58
- Requires careful handling at inference time due to batch dependency (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Requires careful handling at inference time due to batch dependency ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:54:03 - 01:01:37
]Batch normalization effectively controls activations and their distributions. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Batch normalization effectively controls activations and their distributions.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:54:09 - 00:56:05
- Running mean and variance are tracked during training and used for inference ().-   Caveats: - Building makemore Part 3: Activations & Gradients, BatchNorm

- Running mean and variance are tracked during training and used for inference ().- Caveats:

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
00:54:38 - 00:50:20
]Batch normalization paper introduces running mean and standard deviation estimation during training. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Batch normalization paper introduces running mean and standard deviation estimation during training.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
00:56:33 - 01:00:46
@Andrej Karpathy   At , bnmean_running = (0.999 * bnmean_running) + (0.001 * bnmeani), why are you multiplying 0.999 with  bnmean_running and 0.001 with bnmeani. Why this not works *bnmean_running = bnmean_running + bnmeani* - Building makemore Part 3: Activations & Gradients, BatchNorm

@Andrej Karpathy At , bnmean_running = (0.999 * bnmean_running) + (0.001 * bnmeani), why are you multiplying 0.999 with bnmean_running and 0.001 with bnmeani. Why this not works *bnmean_running = bnmean_running + bnmeani*

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @narenbabu629 様 
00:58:07 - 01:55:58
is basically an Infinite Impulse Response (IIR) filter - Building makemore Part 3: Activations & Gradients, BatchNorm

is basically an Infinite Impulse Response (IIR) filter

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @scriptblue 様 
00:58:55 - 01:55:58
Can any one please tell that at ,  why did we take the numbers 0.999 and 0.001 specifically? I am new to neural networks and all of this is a bit overwhelming. Thanks - Building makemore Part 3: Activations & Gradients, BatchNorm

Can any one please tell that at , why did we take the numbers 0.999 and 0.001 specifically? I am new to neural networks and all of this is a bit overwhelming. Thanks

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @phen8318 様 
00:59:00 - 01:55:58
]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:01:10 - 01:01:36
[-6. Batch normalization and resnet in pytorch - Building makemore Part 3: Activations & Gradients, BatchNorm

[-6. Batch normalization and resnet in pytorch

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:01:36 - 01:09:21
- Makes bias terms in preceding layers redundant (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Makes bias terms in preceding layers redundant ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:01:37 - 01:18:40
I can't understand why removing the mean removes the effect of adding a bias? Why would the grad be zero? - Building makemore Part 3: Activations & Gradients, BatchNorm

I can't understand why removing the mean removes the effect of adding a bias? Why would the grad be zero?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @kamikaze9271 様 
01:02:13 - 01:55:58
]Biases are subtracted out in batch normalization, reducing their impact to zero. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Biases are subtracted out in batch normalization, reducing their impact to zero.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:02:30 - 01:03:13
batch normalization: summary - Building makemore Part 3: Activations & Gradients, BatchNorm

batch normalization: summary

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:03:07 - 01:04:50
]Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:03:53 - 01:07:25
At  would it help at the end of the training to optimize with bnmean_running and bnstd_running to normalize the preactivations hpreact? Maybe at that point regularization isn't necessary anymore and the rest of the weights can be optimized for the particular batch norm calibration that will be used during inference. - Building makemore Part 3: Activations & Gradients, BatchNorm

At would it help at the end of the training to optimize with bnmean_running and bnstd_running to normalize the preactivations hpreact? Maybe at that point regularization isn't necessary anymore and the rest of the weights can be optimized for the particular batch norm calibration that will be used during inference.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:04:35 - 01:55:58
real example: resnet50 walkthrough - Building makemore Part 3: Activations & Gradients, BatchNorm

real example: resnet50 walkthrough

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:04:50 - 01:14:10
also I would add that ReLU is much easier to compute (max of 2 values and derivative is eighter 0 or 1) than tanh where we have exponents - Building makemore Part 3: Activations & Gradients, BatchNorm

also I would add that ReLU is much easier to compute (max of 2 values and derivative is eighter 0 or 1) than tanh where we have exponents

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @Splish_Splash 様 
01:07:06 - 01:55:58
]Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:07:53 - 01:09:21
-   Default PyTorch initialization schemes and parameters are discussed (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Default PyTorch initialization schemes and parameters are discussed ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:08:52 - 01:19:13
[-7. Pytorch weight initialization and batch normalization - Building makemore Part 3: Activations & Gradients, BatchNorm

[-7. Pytorch weight initialization and batch normalization

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:09:21 - 01:23:37
-[-1/fan-in square root from a uniform distribution. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-1/fan-in square root from a uniform distribution.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:10:05 - 01:10:32
Great video, I loved it. Just a question. In the Linear layer on PyTorch at , he says that to initialise the weights the uniform distribution is used, but then in the implementation of the Linear layer when PyTorch-ifying the code he uses the Normal distribution. Did I loose something or he committed a "mistake" ? - Building makemore Part 3: Activations & Gradients, BatchNorm

Great video, I loved it. Just a question. In the Linear layer on PyTorch at , he says that to initialise the weights the uniform distribution is used, but then in the implementation of the Linear layer when PyTorch-ifying the code he uses the Normal distribution. Did I loose something or he committed a "mistake" ?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @memex9953 様 
01:10:23 - 01:55:58
]Pytorch initializes weights using - Building makemore Part 3: Activations & Gradients, BatchNorm

]Pytorch initializes weights using

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:10:32 - 01:11:11
reason they're doing this is if you have a roughly gsan input this will ensure that out of this layer you will have a - Building makemore Part 3: Activations & Gradients, BatchNorm

reason they're doing this is if you have a roughly gsan input this will ensure that out of this layer you will have a

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @BenEng 様 
01:11:03 - 01:11:10
roughly Gan output and you you basically achieve that by scaling the weights by - Building makemore Part 3: Activations & Gradients, BatchNorm

roughly Gan output and you you basically achieve that by scaling the weights by

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @BenEng 様 
01:11:10 - 01:55:58
-[-1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:11:11 - 01:11:40
]Scaling weights by - Building makemore Part 3: Activations & Gradients, BatchNorm

]Scaling weights by

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:11:40 - 01:14:02
summary of the lecture - Building makemore Part 3: Activations & Gradients, BatchNorm

summary of the lecture

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:14:10 - 01:18:35
]Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:14:35 - 01:16:00
]Batch normalization centers data for gaussian activations in deep neural networks. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Batch normalization centers data for gaussian activations in deep neural networks.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:16:30 - 01:17:32
-[-2015, enabled reliable training of much deeper neural nets. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-2015, enabled reliable training of much deeper neural nets.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:17:32 - 01:18:02
]Batch normalization, influential in - Building makemore Part 3: Activations & Gradients, BatchNorm

]Batch normalization, influential in

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:18:02 - 01:23:39
He says 'Bye', but looking at the time, it seems too early []. Most people don't want lectures to be long, but I'm happy this one didn't end there. - Building makemore Part 3: Activations & Gradients, BatchNorm

He says 'Bye', but looking at the time, it seems too early []. Most people don't want lectures to be long, but I'm happy this one didn't end there.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @JavArButt 様 
01:18:30 - 01:55:58
just kidding: part2: PyTorch-ifying the code - Building makemore Part 3: Activations & Gradients, BatchNorm

just kidding: part2: PyTorch-ifying the code

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:18:35 - 01:26:51
always gets me - Building makemore Part 3: Activations & Gradients, BatchNorm

always gets me

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @aron_g. 様 
01:18:35 - 01:55:58
The "Okay, so I lied" moment was too relatable xD - Building makemore Part 3: Activations & Gradients, BatchNorm

The "Okay, so I lied" moment was too relatable xD

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @fabianandresvagnoni5057 様 
01:18:36 - 01:55:58
*PyTorch-ifying the code (****):* - Building makemore Part 3: Activations & Gradients, BatchNorm

*PyTorch-ifying the code (****):*

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:18:40 - 01:19:26
I don't understand  where the layers are organized by putting a tanh after each linear layer while the initialization of the linear layer is `self.weight = torch.randn((fan_in, fan_out), generator=g) / fan_in**0.5`. I think it's not Kaiming initialization, because the gain for tanh is `5/3`, but in the code it's set to `1`, - Building makemore Part 3: Activations & Gradients, BatchNorm

I don't understand where the layers are organized by putting a tanh after each linear layer while the initialization of the linear layer is `self.weight = torch.randn((fan_in, fan_out), generator=g) / fan_in**0.5`. I think it's not Kaiming initialization, because the gain for tanh is `5/3`, but in the code it's set to `1`,

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @yukuanlu6676 様 
01:18:59 - 01:55:58
*Diagnostic Tools (****):* - Building makemore Part 3: Activations & Gradients, BatchNorm

*Diagnostic Tools (****):*

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:19:13 - 01:26:53
-   Code is restructured using torch.nn.Module subclasses for linear, batch normalization, and tanh layers ().-   This modular approach aligns with PyTorch's structure and allows easy stacking of layers. - Building makemore Part 3: Activations & Gradients, BatchNorm

- Code is restructured using torch.nn.Module subclasses for linear, batch normalization, and tanh layers ().- This modular approach aligns with PyTorch's structure and allows easy stacking of layers.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:19:26 - 01:08:52
[-8. Custom pytorch layer and network analysis - Building makemore Part 3: Activations & Gradients, BatchNorm

[-8. Custom pytorch layer and network analysis

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:23:39 - 01:55:56
]Updating buffers using exponential moving average with torch.nograd context manager. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Updating buffers using exponential moving average with torch.nograd context manager.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:24:32 - 01:25:47
Why is the last layer made "less confident like we saw" and where did we see this? - Building makemore Part 3: Activations & Gradients, BatchNorm

Why is the last layer made "less confident like we saw" and where did we see this?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @MagicBoterham 様 
01:25:23 - 01:55:58
-[-46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:25:47 - 01:27:11
@ I'd use emb.flatten(1, 2) instead of emb.view(emb.shape[0], -1) to combine two last dimensions into one. It feels that it is better to avoid shape lookup - emb.shape[0] - Building makemore Part 3: Activations & Gradients, BatchNorm

@ I'd use emb.flatten(1, 2) instead of emb.view(emb.shape[0], -1) to combine two last dimensions into one. It feels that it is better to avoid shape lookup - emb.shape[0]

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @apivovarov2 様 
01:26:25 - 01:55:58
viz #1: forward pass activations statistics - Building makemore Part 3: Activations & Gradients, BatchNorm

viz #1: forward pass activations statistics

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:26:51 - 01:30:54
-   Forward pass activations: Should exhibit a stable distribution across layers, indicating proper scaling (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Forward pass activations: Should exhibit a stable distribution across layers, indicating proper scaling ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:26:53 - 01:30:57
-   Visualization of statistics: Histograms of activations, gradients, weights, and update:data ratios reveal potential issues during training (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Visualization of statistics: Histograms of activations, gradients, weights, and update:data ratios reveal potential issues during training ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:26:53 - 01:26:53
]The model has - Building makemore Part 3: Activations & Gradients, BatchNorm

]The model has

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:27:11 - 01:28:04
-[-20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:28:04 - 01:28:30
]Saturation stabilizes at - Building makemore Part 3: Activations & Gradients, BatchNorm

]Saturation stabilizes at

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:28:30 - 01:33:19
? Anyone please explain at - Building makemore Part 3: Activations & Gradients, BatchNorm

? Anyone please explain at

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @mehul4mak 様 
01:28:58 - 01:55:58
Around the  mark,  I think I missed why some saturation (around 5%) is better than no saturation at all. Didn't saturation impede further training? Perhaps he just meant that 5% is low enough, and that's the best we can do if we want to avoid deeper activations from converging to zero? - Building makemore Part 3: Activations & Gradients, BatchNorm

Around the mark, I think I missed why some saturation (around 5%) is better than no saturation at all. Didn't saturation impede further training? Perhaps he just meant that 5% is low enough, and that's the best we can do if we want to avoid deeper activations from converging to zero?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @TheOtroManolo 様 
01:30:00 - 01:55:58
The 5/3 gain in the tanh comes for the average value of tanh^2(x) where x is distributed as a Gaussian, i.e. - Building makemore Part 3: Activations & Gradients, BatchNorm

The 5/3 gain in the tanh comes for the average value of tanh^2(x) where x is distributed as a Gaussian, i.e.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @leopetrini 様 
01:30:10 - 01:55:58
I'm at , so haven't finished yet. But something is unclear: what's the point of stacking these layers instead of having just one Linear and one Tanh? Since tanh squashes and afterwards we're diffusing, it seems to me like we're doing accordion-like work unnecessarily. What is the benefit we're getting? - Building makemore Part 3: Activations & Gradients, BatchNorm

I'm at , so haven't finished yet. But something is unclear: what's the point of stacking these layers instead of having just one Linear and one Tanh? Since tanh squashes and afterwards we're diffusing, it seems to me like we're doing accordion-like work unnecessarily. What is the benefit we're getting?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @ovidiuc4 様 
01:30:28 - 01:55:58
5/3=1.66... is pretty close to the golden ratio 1.61803. Coincidence? - Building makemore Part 3: Activations & Gradients, BatchNorm

5/3=1.66... is pretty close to the golden ratio 1.61803. Coincidence?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @obnoxiaaeristokles3872 様 
01:30:36 - 01:55:58
viz #2: backward pass gradient statistics - Building makemore Part 3: Activations & Gradients, BatchNorm

viz #2: backward pass gradient statistics

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:30:54 - 01:32:07
-   Backward pass gradients: Should be similar across layers, signifying balanced gradient flow (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Backward pass gradients: Should be similar across layers, signifying balanced gradient flow ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:30:57 - 01:36:20
the fully linear case of no non-linearities - Building makemore Part 3: Activations & Gradients, BatchNorm

the fully linear case of no non-linearities

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:32:07 - 01:36:15
-[-1 prevents shrinking and diffusion in batch normalization. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-1 prevents shrinking and diffusion in batch normalization.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:33:19 - 01:33:50
The reason the gradients of the higer layers have a bigger deviation (in the absence of tanh layer), is that you can write the  whole NN as a sum of products, and it is easy to see that each weight of Layer 0 appears in 1 term, of layer 1 in 30 terms, of layer 2 in 3000 terms and so on. Therefore a small change of a weight in higer layers changes the output more. - Building makemore Part 3: Activations & Gradients, BatchNorm

The reason the gradients of the higer layers have a bigger deviation (in the absence of tanh layer), is that you can write the whole NN as a sum of products, and it is easy to see that each weight of Layer 0 appears in 1 term, of layer 1 in 30 terms, of layer 2 in 3000 terms and so on. Therefore a small change of a weight in higer layers changes the output more.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @rmajdodin 様 
01:33:30 - 01:55:58
]Setting gain correctly at - Building makemore Part 3: Activations & Gradients, BatchNorm

]Setting gain correctly at

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:33:50 - 01:38:41
Does anyone know the paper about "analyzing infinitely linear layers"? Andrej mentioned in the video - Building makemore Part 3: Activations & Gradients, BatchNorm

Does anyone know the paper about "analyzing infinitely linear layers"? Andrej mentioned in the video

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @emid6811 様 
01:35:48 - 01:55:58
There is one doubt I have @ and that is regarding the condition p.dim==2, I don't understand why this was done and which parameters it will filter out? - Building makemore Part 3: Activations & Gradients, BatchNorm

There is one doubt I have @ and that is regarding the condition p.dim==2, I don't understand why this was done and which parameters it will filter out?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @navalgupta5807 様 
01:35:59 - 01:55:58
viz #3: parameter activation and gradient statistics - Building makemore Part 3: Activations & Gradients, BatchNorm

viz #3: parameter activation and gradient statistics

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:36:15 - 01:39:55
-   Parameter weights: Distribution and scale should be monitored for anomalies and asymmetries (). - Building makemore Part 3: Activations & Gradients, BatchNorm

- Parameter weights: Distribution and scale should be monitored for anomalies and asymmetries ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:36:20 - 01:39:56
-[-100 times greater, causing faster training, but it self-corrects with longer training. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-100 times greater, causing faster training, but it self-corrects with longer training.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:38:41 - 01:39:11
"That's problematic because in that simple stochastic gradient setup you would be training this last layer 10x faster with respect to the other layers". Why 10x faster? - Building makemore Part 3: Activations & Gradients, BatchNorm

"That's problematic because in that simple stochastic gradient setup you would be training this last layer 10x faster with respect to the other layers". Why 10x faster?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @Koyaanisqatsi2000 様 
01:38:45 - 01:55:58
]The last layer has gradients - Building makemore Part 3: Activations & Gradients, BatchNorm

]The last layer has gradients

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:39:11 - 01:43:18
viz #4: update:data ratio over time - Building makemore Part 3: Activations & Gradients, BatchNorm

viz #4: update:data ratio over time

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:39:55 - 01:46:04
on a log scale, indicating a good learning rate and balanced parameter updates (). - Building makemore Part 3: Activations & Gradients, BatchNorm

on a log scale, indicating a good learning rate and balanced parameter updates ().

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @wolpumba4099 様 
01:39:56 - 01:55:58
did you try using log L2 norm ratio here instead of std? you're using variance as a proxy for how big updates are w.r.t. data values - Building makemore Part 3: Activations & Gradients, BatchNorm

did you try using log L2 norm ratio here instead of std? you're using variance as a proxy for how big updates are w.r.t. data values

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @TheAIEpiphany 様 
01:40:25 - 01:55:58
can someone explain why we divide std of gradient to std of data instead of using mean? Weight update ratio = grad*learning_rate/weight_value. As we have multiple inputs and multiple entries in batch, we could take mean to calculate single value, cannot figure out how std is a better option. - Building makemore Part 3: Activations & Gradients, BatchNorm

can someone explain why we divide std of gradient to std of data instead of using mean? Weight update ratio = grad*learning_rate/weight_value. As we have multiple inputs and multiple entries in batch, we could take mean to calculate single value, cannot figure out how std is a better option.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @vks43523 様 
01:40:38 - 01:55:58
At   why do we use standard deviation to calculate update to data ratio? - Building makemore Part 3: Activations & Gradients, BatchNorm

At why do we use standard deviation to calculate update to data ratio?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zxynj 様 
01:40:46 - 01:55:58
Why stddev here? Wouldn't we want to use something like the L1-norm? Also, wouldn't we want to log this metric before updating the parameters? - Building makemore Part 3: Activations & Gradients, BatchNorm

Why stddev here? Wouldn't we want to use something like the L1-norm? Also, wouldn't we want to log this metric before updating the parameters?

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @jonathanhampton6055 様 
01:40:49 - 01:55:58
-[-3 on log plot. - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-3 on log plot.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:43:18 - 01:43:42
]Monitoring update ratio for parameters to ensure efficient training, aiming for - - Building makemore Part 3: Activations & Gradients, BatchNorm

]Monitoring update ratio for parameters to ensure efficient training, aiming for -

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:43:42 - 01:51:36
bringing back batchnorm, looking at the visualizations - Building makemore Part 3: Activations & Gradients, BatchNorm

bringing back batchnorm, looking at the visualizations

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:46:04 - 01:51:34
summary of the lecture for real this time - Building makemore Part 3: Activations & Gradients, BatchNorm

summary of the lecture for real this time

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日 
01:51:34 - 01:55:58
]Introduce batch normalization and pytorch modules for neural networks. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Introduce batch normalization and pytorch modules for neural networks.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:52:04 - 01:52:39
]Introduction to diagnostic tools for neural network analysis. - Building makemore Part 3: Activations & Gradients, BatchNorm

]Introduction to diagnostic tools for neural network analysis.

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:53:06 - 01:54:45
-[- - Building makemore Part 3: Activations & Gradients, BatchNorm

-[-

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:54:45 - 01:55:50
]Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress - Building makemore Part 3: Activations & Gradients, BatchNorm

]Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:55:50 - 01:55:58
] - Building makemore Part 3: Activations & Gradients, BatchNorm

]

Building makemore Part 3: Activations & Gradients, BatchNorm
2022年10月05日  @zlsj861 様 
01:55:56 - 01:24:01

Andrej Karpathy

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。動画はYouTube公式の動画プレイヤーで再生されるため、再生数・収益などはすべて元動画に還元されます。

Timetable

動画タイムテーブル

タイムテーブルが見つかりませんでした。