
introduction

- Introduction

pretraining data (internet)

- LLM Pre-training

Atound, you explain a really interesting notion, that models need to "think" before producing a complex response, thats because each layer in a neural network has finite computation. I feel like its somewhat related to the notion of computational irreducibility Stephen Wolfram talks about. This is also why we humans need to spend some time thinking about complex issues before coming up with a good response.

But what if the ultimate joke about pelicans is actually 'the the the the the the,' but we simply don't have enough intelligence to understand it—just like an unusual move in the game of Go? XD

wow amazing hours so much in few hours .. Saved me hours of research and insprie me for more ..great work looking forward for new such interesting videos..

at , talks about eliminating racist sites during corpus preprocessing. This can introduce bias by eliminating candid discussion of, for example, average IQ test scores of racial subgroups. Claude refuses to answer this altogether, calling race a constructed concept. ChatGPT and Gemini, at the time I queried them, both produced valid, honest outputs, which aligned with the research. Those of you so enamored with Claude are still trapped in Dario's echo-chamber. But society has moved on, now (2025). Will you?

tokenization

neural network I/O

- Neural Net & Training

neural network internals

inference

GPT-2: training and inference

Somewhere around , you said something about training 1 million tokens. Do you mean you train chunks of 1 million tokens to generate output or you train different tokens that add up to a million to generate output?

- GPUs & Model Costs

Llama 3.1 base model inference

: Parallel universes !!! Just loving these analogies - awesome !

pretraining to post-training

post-training data (conversations)

- Build LLM Assistant

"something went wrong" 😂 lol I love that he left this in there!

his genuine laugh at ChatGPT error is so pure and spontaneous. How can someone not love Karpathy!!?? Sir you are pure Gold for humanity.

hallucinations, tool use, knowledge/working memory

The chapter about hallucinations was so insightful. Never heard about it as an issue of the dataset, i.e., it wasn't trained to say "I don't know" and how one can test the knowledge of the model. Thanks!

Observation: Approx. at , Andrej tests the question "Who is Orson Kovacs" using falcon-7b-instruct in HF playground, the temperature is still 1.0 which will make the model to respond in a balanced manner between randomness and deterministic. Although it makes up stuff to behave like hallucinations, it is good to test out with temperature less or more than 1.0 to understand how the factuality of the data varies.

you mentioned around mark - the reason why you allow the model to say i don't know, instead of augmenting it with the new knowledge, is it because there's infinite amount of knowledge to learn so that it's virtually impossible to learn knowledge, and thus it's better to train it to know when to refuse? In other words, say if somehow the model CAN learn ALL the knowledge of the world, we won't need to train it to stop hallucinating? Thanks.

Thanks for the informative video! I have a question about training language models for tool use, specifically regarding the process you described around

knowledge of self

models need tokens to think

@. Question. I was just reading a paper recently (I believe it was from Anthropic, but sadly I can't find it now) that when they have looked at "thinking models", it appears the final answer is generally already determined well before the reasoning process begins. Then the model just fills in the chain of thought to get from the question to where it wants to go. Isn't this exactly what you said is not the correct way to handle this? Can you comment on why, if this is the "wrong" approach, it seems to be what modern models are doing?

@ that is elucidating! This is the first time I’ve heard of this concept. Thank you Andrej.

This teacher is very good at giving cute examples Appreciate it and I agree it.

tokenization revisited: models struggle with spelling

Wow.. love this explanation about why these models fail at character related and counting related task

jagged intelligence

supervised finetuning to reinforcement learning

- Model Training in Practice

reinforcement learning

DeepSeek-R1

Deepseek says “$3 is a bit expensive for an apple, but maybe they’re organic or something” 😂

What a treat!!! At , haha when you say this is very busy very ugly because of google not being able to nail that was epic hahah

AlphaGo

Thank you for the video Andrej! One small note: at , the dashed line in the AlphaGo Zero plot is the Elo of the version of AlphaGo that *defeated* Lee in 2016 (not the Elo of Lee himself).

reinforcement learning from human feedback (RLHF)

Tiny typo "let's add it to the dataset and give it an ordering that's extremely like a score of 5" -> SHOULD BE "let's add it to the dataset and give it an ordering that's extremely like a score of 1"

preview of things to come

keeping track of LLMs

if you have come till this time stamp then finish the video and go and build something with LLMs.😊

where to find LLMs

grand summary

In principle these models are capable of analogies no human has had. Wow😮
