3Blue1Brown

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。動画はYouTube公式の動画プレイヤーで再生されるため、再生数・収益などはすべて元動画に還元されます。

Timetable

動画タイムテーブル

動画数:214件

- Where facts in LLMs live - How might LLMs store facts | DL7

- Where facts in LLMs live

How might LLMs store facts | DL7
2024年08月31日 
00:00:00 - 00:02:15
Starts at - How might LLMs store facts | DL7

Starts at

How might LLMs store facts | DL7
2024年08月31日  @donson3326 様 
00:00:01 - 00:22:43
Wait we don't actually know how it works fully? - How might LLMs store facts | DL7

Wait we don't actually know how it works fully?

How might LLMs store facts | DL7
2024年08月31日  @chunlingjohnnyliu2889 様 
00:00:45 - 00:22:43
hold, they build the LLM, and don’t know how the facts are cataloged?  This is gonna be a doozy. - How might LLMs store facts | DL7

hold, they build the LLM, and don’t know how the facts are cataloged? This is gonna be a doozy.

How might LLMs store facts | DL7
2024年08月31日  @donewithprecision785 様 
00:00:56 - 00:22:43
- Quick refresher on transformers - How might LLMs store facts | DL7

- Quick refresher on transformers

How might LLMs store facts | DL7
2024年08月31日 
00:02:15 - 00:04:39
My understanding is that an LLM does not in fact store facts, but through the process of predicting word associations through being trained on an absolutely astoundingly large set of examples, it "stores" the likelihood that the word "basketball" is historically the most likely next word in the series.  It doesn't have any grasp of a concept of basketball in any sort of meaningful or even static way.  This is exactly the problem I'm trying to solve, and honestly I think I found a solution.  I just don't know yet how reliable it is on a large scale, or how economic it is in terms of the required computing power.  We'll see. - How might LLMs store facts | DL7

My understanding is that an LLM does not in fact store facts, but through the process of predicting word associations through being trained on an absolutely astoundingly large set of examples, it "stores" the likelihood that the word "basketball" is historically the most likely next word in the series. It doesn't have any grasp of a concept of basketball in any sort of meaningful or even static way. This is exactly the problem I'm trying to solve, and honestly I think I found a solution. I just don't know yet how reliable it is on a large scale, or how economic it is in terms of the required computing power. We'll see.

How might LLMs store facts | DL7
2024年08月31日  @MrRavaging 様 
00:02:15 - 00:22:43
to - How might LLMs store facts | DL7

to

How might LLMs store facts | DL7
2024年08月31日  @baransam1 様 
00:02:35 - 00:02:45
what would be simplified is half a circumference with two half circumferences inside it, into infinity... that would show some precision...as to how the machine AI is doing this efficiently...while increasing the accuracy of predictions...  the selection line on the graph..sequence of vectors lines...attention lines - How might LLMs store facts | DL7

what would be simplified is half a circumference with two half circumferences inside it, into infinity... that would show some precision...as to how the machine AI is doing this efficiently...while increasing the accuracy of predictions... the selection line on the graph..sequence of vectors lines...attention lines

How might LLMs store facts | DL7
2024年08月31日  @LordWarden170 様 
00:02:44 - 00:22:43
: The tokens (words) convey context information in each other making the embedding a richer/nuanced version than a simple meaning of the word. When this animation is shown, the arrows are shown moving from a later token to earlier token as well. Isn't this contradictory to the concept introduced during masking where it is said that only earlier words are allowed to enrich the later words. (This is a common animation shown multiple time in this series). - How might LLMs store facts | DL7

: The tokens (words) convey context information in each other making the embedding a richer/nuanced version than a simple meaning of the word. When this animation is shown, the arrows are shown moving from a later token to earlier token as well. Isn't this contradictory to the concept introduced during masking where it is said that only earlier words are allowed to enrich the later words. (This is a common animation shown multiple time in this series).

How might LLMs store facts | DL7
2024年08月31日  @baransam1 様 
00:02:45 - 00:22:43
Only the joker would pick stranger over stronger - How might LLMs store facts | DL7

Only the joker would pick stranger over stronger

How might LLMs store facts | DL7
2024年08月31日  @robinmitchell6803 様 
00:03:16 - 00:22:43
Live in a "high dimension" Please expand. - How might LLMs store facts | DL7

Live in a "high dimension" Please expand.

How might LLMs store facts | DL7
2024年08月31日  @TheSpiritualCollective444 様 
00:03:38 - 00:22:43
I was unable to reproduce woman-man ~= aunt-uncle using either OpenAIEmbedding model 'text-embedding-3-small' or the older 'text-embedding-ada-002' model using LangChain. Cosine similarity of 0.29. I tried lots of pairings: aunt-uncle, woman-man, sister-brother, and queen-king. All had cosine similarities in the range 0.29 to 0.38. Happy to share my work if you're curious. - How might LLMs store facts | DL7

I was unable to reproduce woman-man ~= aunt-uncle using either OpenAIEmbedding model 'text-embedding-3-small' or the older 'text-embedding-ada-002' model using LangChain. Cosine similarity of 0.29. I tried lots of pairings: aunt-uncle, woman-man, sister-brother, and queen-king. All had cosine similarities in the range 0.29 to 0.38. Happy to share my work if you're curious.

How might LLMs store facts | DL7
2024年08月31日  @davidmorse5411 様 
00:03:41 - 00:22:43
this subtraction of vector makes me wonder if all of category theory can be described using linear algebra - How might LLMs store facts | DL7

this subtraction of vector makes me wonder if all of category theory can be described using linear algebra

How might LLMs store facts | DL7
2024年08月31日  @alejrandom6592 様 
00:03:41 - 00:22:43
If we consider this higher-dimensional embedding space, in which each direction encodes a specific meaning, each vector in this space represents a certain distinct concept, right? (a vector 'meaning' man, woman, uncle, or aunt, as per the example at ). - How might LLMs store facts | DL7

If we consider this higher-dimensional embedding space, in which each direction encodes a specific meaning, each vector in this space represents a certain distinct concept, right? (a vector 'meaning' man, woman, uncle, or aunt, as per the example at ).

How might LLMs store facts | DL7
2024年08月31日  @daantromp5195 様 
00:03:44 - 00:22:43
Kinda like how neural synapses work...when neurons wire together, they fire together and then it tickles one of the adjacent "dormant" neurons and it lights up with a memory like, "Oh yeah! Totally forgot about that until you just mentioned it again to me...." right? - How might LLMs store facts | DL7

Kinda like how neural synapses work...when neurons wire together, they fire together and then it tickles one of the adjacent "dormant" neurons and it lights up with a memory like, "Oh yeah! Totally forgot about that until you just mentioned it again to me...." right?

How might LLMs store facts | DL7
2024年08月31日  @TheSpiritualCollective444 様 
00:03:59 - 00:22:43
I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ? - How might LLMs store facts | DL7

I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ?

How might LLMs store facts | DL7
2024年08月31日  @timeflex 様 
00:04:10 - 00:22:43
- Assumptions for our toy example - How might LLMs store facts | DL7

- Assumptions for our toy example

How might LLMs store facts | DL7
2024年08月31日 
00:04:39 - 00:06:07
This is unironically how I understand "the spectrum" of autism, for example. - How might LLMs store facts | DL7

This is unironically how I understand "the spectrum" of autism, for example.

How might LLMs store facts | DL7
2024年08月31日  @KillianTwew 様 
00:05:08 - 00:22:43
in  are you impliying that the vectors are not normalized and therefore a dot product of 1 does not mean they are parallel? So what we call semantic similarity is not a measure of pointing towards the same direction? So it can be that the dot product is 1 in several directions at the same time - How might LLMs store facts | DL7

in are you impliying that the vectors are not normalized and therefore a dot product of 1 does not mean they are parallel? So what we call semantic similarity is not a measure of pointing towards the same direction? So it can be that the dot product is 1 in several directions at the same time

How might LLMs store facts | DL7
2024年08月31日  @enriquebalpstraffon 様 
00:05:10 - 00:22:43
I don't understand the assumptions made around  about dot products.  Why dot product being 1 is used to mean that the vector encodes that particular direction/concept? I would have thought that the vector needs to be parallel to that concept vector to assume it encodes that concept. But then a vector would only be able to encode one concept. Is this why dot product=1 is just sort of conventionally chosen? - How might LLMs store facts | DL7

I don't understand the assumptions made around about dot products. Why dot product being 1 is used to mean that the vector encodes that particular direction/concept? I would have thought that the vector needs to be parallel to that concept vector to assume it encodes that concept. But then a vector would only be able to encode one concept. Is this why dot product=1 is just sort of conventionally chosen?

How might LLMs store facts | DL7
2024年08月31日  @gauravfotedar 様 
00:05:30 - 00:22:43
How can dot product of a vector with both "Michael" and "Jordan" be 1 when earlier it was said that "Michael" and "Jordan" are nearly orthogonal to each other? - How might LLMs store facts | DL7

How can dot product of a vector with both "Michael" and "Jordan" be 1 when earlier it was said that "Michael" and "Jordan" are nearly orthogonal to each other?

How might LLMs store facts | DL7
2024年08月31日  @vivekrai1974 様 
00:05:53 - 00:22:43
- Inside a multilayer perceptron - How might LLMs store facts | DL7

- Inside a multilayer perceptron

How might LLMs store facts | DL7
2024年08月31日 
00:06:07 - 00:15:38
Does that sequence of high-dimension vectors (let’s call it a 1D array) in the MLP behave as its own tensor in the LLM? - How might LLMs store facts | DL7

Does that sequence of high-dimension vectors (let’s call it a 1D array) in the MLP behave as its own tensor in the LLM?

How might LLMs store facts | DL7
2024年08月31日  @mistahtom 様 
00:07:00 - 00:22:43
You are telling me that the AI Is akinator - How might LLMs store facts | DL7

You are telling me that the AI Is akinator

How might LLMs store facts | DL7
2024年08月31日  @joaquincurrais4856 様 
00:08:47 - 00:22:43
Who determines "bias" or is it a "vector" with a "code" as well? - How might LLMs store facts | DL7

Who determines "bias" or is it a "vector" with a "code" as well?

How might LLMs store facts | DL7
2024年08月31日  @TheSpiritualCollective444 様 
00:09:33 - 00:22:43
) otherwise all borders go through - How might LLMs store facts | DL7

) otherwise all borders go through

How might LLMs store facts | DL7
2024年08月31日  @anti-troll-software6151 様 
00:10:30 - 00:22:43
, "...continuing with the deep learning tradition of overly fancy names..." 😂🤣😂 - How might LLMs store facts | DL7

, "...continuing with the deep learning tradition of overly fancy names..." 😂🤣😂

How might LLMs store facts | DL7
2024年08月31日  @baxile 様 
00:10:49 - 00:22:43
So this is just an "if, then" function? - How might LLMs store facts | DL7

So this is just an "if, then" function?

How might LLMs store facts | DL7
2024年08月31日  @TheSpiritualCollective444 様 
00:11:33 - 00:22:43
The Bias exists to move the border between yes or no (see 0. It is literally the b in y=mx+b. Without it all y=mx go through (0,0) - How might LLMs store facts | DL7

The Bias exists to move the border between yes or no (see 0. It is literally the b in y=mx+b. Without it all y=mx go through (0,0)

How might LLMs store facts | DL7
2024年08月31日  @anti-troll-software6151 様 
00:13:44 - 00:10:30
So the weights are simultaneously nudged to form vector encodings for output words as columns, and patterns in rows to get the values of how much each column should be used based on multiplication by input? - How might LLMs store facts | DL7

So the weights are simultaneously nudged to form vector encodings for output words as columns, and patterns in rows to get the values of how much each column should be used based on multiplication by input?

How might LLMs store facts | DL7
2024年08月31日  @Valentin-d1j 様 
00:13:53 - 00:22:43
As for the  question of what the bias does - it's just a control at what height you put the threshold of the RELU. This way you can clip the data at different values depending on the context. - How might LLMs store facts | DL7

As for the question of what the bias does - it's just a control at what height you put the threshold of the RELU. This way you can clip the data at different values depending on the context.

How might LLMs store facts | DL7
2024年08月31日  @bzqp2 様 
00:14:00 - 00:22:43
At  I think this might be a misinterpretation. The MLP block uses the same 50,000 neurons for all tokens in the sequence and not 50,000 neurons per token. @3blue1brown is that correct? - How might LLMs store facts | DL7

At I think this might be a misinterpretation. The MLP block uses the same 50,000 neurons for all tokens in the sequence and not 50,000 neurons per token. @3blue1brown is that correct?

How might LLMs store facts | DL7
2024年08月31日  @alienhunter4870 様 
00:14:40 - 00:22:43
I'm wondering if the phrasing here is a bit misleading. Unless i'm missing something, the block has 50000 neurons but the sequnece of tokens is passed through it, meaning you get number of activations multiplied with number of tokens, not neurons per se. This part might lead someone to thing that those neurons are different for each tokens but they are not. only activations.Regardless, this is an excellent video. - How might LLMs store facts | DL7

I'm wondering if the phrasing here is a bit misleading. Unless i'm missing something, the block has 50000 neurons but the sequnece of tokens is passed through it, meaning you get number of activations multiplied with number of tokens, not neurons per se. This part might lead someone to thing that those neurons are different for each tokens but they are not. only activations.Regardless, this is an excellent video.

How might LLMs store facts | DL7
2024年08月31日  @marksverdhei 様 
00:14:40 - 00:22:43
Are "bias" parts of speech "adjectives" and "adverbs?" - How might LLMs store facts | DL7

Are "bias" parts of speech "adjectives" and "adverbs?"

How might LLMs store facts | DL7
2024年08月31日  @TheSpiritualCollective444 様 
00:14:50 - 00:22:43
the one piece is real - How might LLMs store facts | DL7

the one piece is real

How might LLMs store facts | DL7
2024年08月31日  @samuelgunter 様 
00:15:10 - 00:22:43
ITS REAL - How might LLMs store facts | DL7

ITS REAL

How might LLMs store facts | DL7
2024年08月31日  @kylewood4001 様 
00:15:11 - 00:22:43
- Counting parameters - How might LLMs store facts | DL7

- Counting parameters

How might LLMs store facts | DL7
2024年08月31日 
00:15:38 - 00:17:04
at  the parameters for the FF network are counted. Are these the parameters for the FF network of 1 token? If so, does this mean that the total number of parameters, including shared parameters, is much higher? - How might LLMs store facts | DL7

at the parameters for the FF network are counted. Are these the parameters for the FF network of 1 token? If so, does this mean that the total number of parameters, including shared parameters, is much higher?

How might LLMs store facts | DL7
2024年08月31日  @thomasv92 様 
00:16:08 - 00:22:43
Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in  but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections. - How might LLMs store facts | DL7

Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.

How might LLMs store facts | DL7
2024年08月31日  @zw2249 様 
00:16:47 - 00:22:43
- Superposition - How might LLMs store facts | DL7

- Superposition

How might LLMs store facts | DL7
2024年08月31日 
00:17:04 - 00:21:37
the superposition chapter is great... Watch it guys n girls - How might LLMs store facts | DL7

the superposition chapter is great... Watch it guys n girls

How might LLMs store facts | DL7
2024年08月31日  @FreakAzoiyd 様 
00:17:04 - 00:22:43
@ Re: Superposition - dimensions not being completely independent but rather related.Here's a way to understand superposition that imo was not really clear in the video. - How might LLMs store facts | DL7

@ Re: Superposition - dimensions not being completely independent but rather related.Here's a way to understand superposition that imo was not really clear in the video.

How might LLMs store facts | DL7
2024年08月31日  @stevenlynch3456 様 
00:17:45 - 00:22:43
Where could someone find the source material or "footnotes/bibliography" found for each LLM's main base for facts and standardized information deemed "valid" by independent accredited main international sources or bodies of information? - How might LLMs store facts | DL7

Where could someone find the source material or "footnotes/bibliography" found for each LLM's main base for facts and standardized information deemed "valid" by independent accredited main international sources or bodies of information?

How might LLMs store facts | DL7
2024年08月31日  @TheSpiritualCollective444 様 
00:17:45 - 00:22:43
Is there a way to estimate the amount of additional “dimensions” you get by having 89-91 degrees versus 90 degrees - How might LLMs store facts | DL7

Is there a way to estimate the amount of additional “dimensions” you get by having 89-91 degrees versus 90 degrees

How might LLMs store facts | DL7
2024年08月31日  @carlinw 様 
00:18:05 - 00:22:43
this part is really cool! - How might LLMs store facts | DL7

this part is really cool!

How might LLMs store facts | DL7
2024年08月31日  @johnchessant3012 様 
00:18:28 - 00:22:43
.. such that all the vectors are orthogonal is illuminating (~).It suggests that the surface area of the n-dim sphere being partitioned into a vast quantity of locally flat 'Gaussians' (central limit;-) of similarity directions.Once you have that, plus the layer depth to discriminate conceptual level, one gets to see how it works, though doesn't have any explanatory capability because its vocabulary (numeric vectors) does not bake in the human explanatory phrasings we use (all very 'physician heal thyself' given it's an LLM!) - How might LLMs store facts | DL7

.. such that all the vectors are orthogonal is illuminating (~).It suggests that the surface area of the n-dim sphere being partitioned into a vast quantity of locally flat 'Gaussians' (central limit;-) of similarity directions.Once you have that, plus the layer depth to discriminate conceptual level, one gets to see how it works, though doesn't have any explanatory capability because its vocabulary (numeric vectors) does not bake in the human explanatory phrasings we use (all very 'physician heal thyself' given it's an LLM!)

How might LLMs store facts | DL7
2024年08月31日  @philipoakley5498 様 
00:18:50 - 00:22:43
can't i make it in JavaScript? ^^ - How might LLMs store facts | DL7

can't i make it in JavaScript? ^^

How might LLMs store facts | DL7
2024年08月31日  @Melkanea 様 
00:19:05 - 00:22:43
Important correction: There's an error in the scrappy code I was demoing around , such that in fact not all pairs of vectors end up in that (89°, 91°) range. A few pairs get shot out to have dot products near ±1, hiding in the wings of the plot. I was using a bad cost function that didn't appreciably punish those cases. On closer inspection, it appears not to be possible to get 100k vectors in 100d to be as "nearly orthogonal" as this.  100 dimensions seems to be too low, at least for the (89°, 91°), for the Johnson-Lindenstrauss lemma to really kick in. - How might LLMs store facts | DL7

Important correction: There's an error in the scrappy code I was demoing around , such that in fact not all pairs of vectors end up in that (89°, 91°) range. A few pairs get shot out to have dot products near ±1, hiding in the wings of the plot. I was using a bad cost function that didn't appreciably punish those cases. On closer inspection, it appears not to be possible to get 100k vectors in 100d to be as "nearly orthogonal" as this. 100 dimensions seems to be too low, at least for the (89°, 91°), for the Johnson-Lindenstrauss lemma to really kick in.

How might LLMs store facts | DL7
2024年08月31日  @3blue1brown 様 
00:19:50 - 00:22:43
a bit unclear how addition of noise to "vectors perpendicularity" can create space for additional features .. can somebody help me to understand that ? - How might LLMs store facts | DL7

a bit unclear how addition of noise to "vectors perpendicularity" can create space for additional features .. can somebody help me to understand that ?

How might LLMs store facts | DL7
2024年08月31日  @tempdeltavalue 様 
00:19:54 - 00:22:43
another way to imagine is shooting an arrow in space, and shooting a second arrow in 0.001° different direction. The first inch is nothing, nor is the first 20. But as it goes feet and miles out, it'll eventually be so far apart that it's hard to believe they came from the same bow.Also chaotic pendulums, such as a pendulum on the end of a pendulum. Slight changes ends up with completely different movement. - How might LLMs store facts | DL7

another way to imagine is shooting an arrow in space, and shooting a second arrow in 0.001° different direction. The first inch is nothing, nor is the first 20. But as it goes feet and miles out, it'll eventually be so far apart that it's hard to believe they came from the same bow.Also chaotic pendulums, such as a pendulum on the end of a pendulum. Slight changes ends up with completely different movement.

How might LLMs store facts | DL7
2024年08月31日  @rmt3589 様 
00:20:18 - 00:22:43
more & more it feels as if current networks are mainly our first bookschelves. - How might LLMs store facts | DL7

more & more it feels as if current networks are mainly our first bookschelves.

How might LLMs store facts | DL7
2024年08月31日  @Melkanea 様 
00:20:30 - 00:22:43
times as many independent ideas."  💥 - How might LLMs store facts | DL7

times as many independent ideas." 💥

How might LLMs store facts | DL7
2024年08月31日  @FilippoVitaleIT 様 
00:20:31 - 00:22:43
- Bell LaboratoriesI am currently interning at Bell Laboratories :)A fun fact- Yann Lecun created CNNs at part of his internship at Bell Labs - How might LLMs store facts | DL7

- Bell LaboratoriesI am currently interning at Bell Laboratories :)A fun fact- Yann Lecun created CNNs at part of his internship at Bell Labs

How might LLMs store facts | DL7
2024年08月31日  @PramodhRachuri 様 
00:21:10 - 00:22:43
This reminds me of bloom filters - How might LLMs store facts | DL7

This reminds me of bloom filters

How might LLMs store facts | DL7
2024年08月31日  @alicederyn 様 
00:21:10 - 00:22:43
hey ya'll! 🇺🇿 - How might LLMs store facts | DL7

hey ya'll! 🇺🇿

How might LLMs store facts | DL7
2024年08月31日  @yapsdotgg 様 
00:21:18 - 00:22:43
it seems obvious to me that a superposition would store more data, not because of nearly perpendicular vectors, but because you're effectively moving from a unary system to a higher base. Same reason you can count to 10, or to 1023, on the same amount of fingers - How might LLMs store facts | DL7

it seems obvious to me that a superposition would store more data, not because of nearly perpendicular vectors, but because you're effectively moving from a unary system to a higher base. Same reason you can count to 10, or to 1023, on the same amount of fingers

How might LLMs store facts | DL7
2024年08月31日  @rlrfproductions 様 
00:21:20 - 00:22:43
E . Skip to  there to understand the issue.. Training GPT-4 in 2022 should have taken around a cool thousand years. Then Huang says something silly: He says "Well, they have used a stack of 8000 H100 GPU's and it only took three months" - forgetting that the H100 was only on the drawing board back in 2022 when GPT-4 was trained. Now read a little about the latest discoveries in Brain sciences and I mean especially focus on N400 and P600.. And you tell me how to explain Dan, Rob, Max and Dennis. I'm gonna leave this up to you, as I'm sure you understand what I'm getting at. - How might LLMs store facts | DL7

E . Skip to there to understand the issue.. Training GPT-4 in 2022 should have taken around a cool thousand years. Then Huang says something silly: He says "Well, they have used a stack of 8000 H100 GPU's and it only took three months" - forgetting that the H100 was only on the drawing board back in 2022 when GPT-4 was trained. Now read a little about the latest discoveries in Brain sciences and I mean especially focus on N400 and P600.. And you tell me how to explain Dan, Rob, Max and Dennis. I'm gonna leave this up to you, as I'm sure you understand what I'm getting at.

How might LLMs store facts | DL7
2024年08月31日  @nyyotam4057 様 
00:21:33 - 00:22:43
- Up next - How might LLMs store facts | DL7

- Up next

How might LLMs store facts | DL7
2024年08月31日 
00:21:37 - 00:22:43
also important in the training process is the concept of self supervised learning to harness the mass of unlabelled data (books in nlp) - How might LLMs store facts | DL7

also important in the training process is the concept of self supervised learning to harness the mass of unlabelled data (books in nlp)

How might LLMs store facts | DL7
2024年08月31日  @hugob8180 様 
00:21:45 - 00:22:43
Holograms coming! :D - How might LLMs store facts | DL7

Holograms coming! :D

How might LLMs store facts | DL7
2024年08月31日  @PewrityLab 様 
00:22:10 - 00:22:43
uhhh holograms, I'm so excited,and that is on  top of a excellent video.I'm amazed how you manage to consistently keep such a high standart :D - How might LLMs store facts | DL7

uhhh holograms, I'm so excited,and that is on top of a excellent video.I'm amazed how you manage to consistently keep such a high standart :D

How might LLMs store facts | DL7
2024年08月31日  @AkantorJojo 様 
00:22:15 - 00:22:43
I may have learnt something new here. Grant is saying that the skip connection in the MLP is actually enabling the transformation of the original vector into another vector with enriched contextual meaning. Specifically, in , he is saying that via the summation of the skip connection, the MLP has somehow learnt the directional vector to be added onto the original vector "Michael Jordan", to produced a new output vector that adds "basketball" information. I was originally of the impression that skip connections are only to combat vanishing gradient, and expedite learning. But now Grant is emphasizing it is doing much more! - How might LLMs store facts | DL7

I may have learnt something new here. Grant is saying that the skip connection in the MLP is actually enabling the transformation of the original vector into another vector with enriched contextual meaning. Specifically, in , he is saying that via the summation of the skip connection, the MLP has somehow learnt the directional vector to be added onto the original vector "Michael Jordan", to produced a new output vector that adds "basketball" information. I was originally of the impression that skip connections are only to combat vanishing gradient, and expedite learning. But now Grant is emphasizing it is doing much more!

How might LLMs store facts | DL7
2024年08月31日  @kwew1 様 
00:22:42 - 00:22:43
- End of Harriet Nembhard's introduction - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

- End of Harriet Nembhard's introduction

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日 
00:00:00 - 00:00:45
- The cliché - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

- The cliché

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日 
00:00:45 - 00:02:28
putting a face with a voice, First has always not what I imagined when listening/watching your videos! Second, from here on out when watching, I will always see 3B1B narrating his videos in a cap and gown! - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

putting a face with a voice, First has always not what I imagined when listening/watching your videos! Second, from here on out when watching, I will always see 3B1B narrating his videos in a cap and gown!

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @Dr_Larken 様 
00:01:11 - 00:15:30
Following your dreams requires more than just passion. - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Following your dreams requires more than just passion.

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @HarryPotter-nj3mf 様 
00:01:57 - 00:03:54
At  ... I love how, after Grant points out the 'nerdiness' of the audience, continuing to (joking) say: "... in the vector space of all possible advice", the camera centers on two lovely nerds sitting, without batting an eyelid, thinking: "yeees? ...". - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

At ... I love how, after Grant points out the 'nerdiness' of the audience, continuing to (joking) say: "... in the vector space of all possible advice", the camera centers on two lovely nerds sitting, without batting an eyelid, thinking: "yeees? ...".

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @zethodderskov 様 
00:02:18 - 00:15:30
- The shifting goal - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

- The shifting goal

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日 
00:02:28 - 00:05:57
Following your dreams requires pragmatic concerns beyond inspiration. - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Following your dreams requires pragmatic concerns beyond inspiration.

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @HarryPotter-nj3mf 様 
00:03:54 - 00:05:51
a guy in the audience sighs after hearing that the goal of their life changes today. You can see the stress growing in the faces of the audience. - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

a guy in the audience sighs after hearing that the goal of their life changes today. You can see the stress growing in the faces of the audience.

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @asifasmatnibir 様 
00:03:55 - 00:15:30
Transition from personal growth to adding value to others - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Transition from personal growth to adding value to others

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @HarryPotter-nj3mf 様 
00:05:51 - 00:07:48
- Action precedes motivation - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

- Action precedes motivation

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日 
00:05:57 - 00:07:02
wow. Well that hits hard. I hope I'll remember it. - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

wow. Well that hits hard. I hope I'll remember it.

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @robinmc142 様 
00:06:21 - 00:15:30
- Timing - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

- Timing

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日 
00:07:02 - 00:10:47
Action precedes motivation in finding a career you love - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Action precedes motivation in finding a career you love

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @HarryPotter-nj3mf 様 
00:07:48 - 00:09:45
Survivorship bias affects the advice of pursuing high-risk, high-reward paths. - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Survivorship bias affects the advice of pursuing high-risk, high-reward paths.

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @HarryPotter-nj3mf 様 
00:09:45 - 00:11:42
- Know your influence - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

- Know your influence

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日 
00:10:47 - 00:12:05
"Cut Defence Ties". Glad to see the Student Body Politic has some positive values... 😀 And a cool idea, btw, I've not seen that in the UK, a slogan top of the mortar-board. But speaking with a tassle??? I'm not sure they do that here, either... 🤔 - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

"Cut Defence Ties". Glad to see the Student Body Politic has some positive values... 😀 And a cool idea, btw, I've not seen that in the UK, a slogan top of the mortar-board. But speaking with a tassle??? I'm not sure they do that here, either... 🤔

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @lawrence18uk 様 
00:11:02 - 00:15:30
Success is a function of the value you bring to others - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Success is a function of the value you bring to others

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @HarryPotter-nj3mf 様 
00:11:42 - 00:13:39
lol my calculus teacher told me I should consider a double major or at least a minor in math. I took his advice as a compliment, and completely ignored it. Now I am unemployed training to become a data scientist! Trust those who have more experience than you! - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

lol my calculus teacher told me I should consider a double major or at least a minor in math. I took his advice as a compliment, and completely ignored it. Now I am unemployed training to become a data scientist! Trust those who have more experience than you!

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @2AitchSquared 様 
00:11:44 - 00:15:30
that's such an amazing mentality. - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

that's such an amazing mentality.

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @robinmc142 様 
00:11:58 - 00:15:30
- Anticipate change - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

- Anticipate change

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日 
00:12:05 - 00:15:30
"CUT DEFENSE TIES" in the audience hell yeah - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

"CUT DEFENSE TIES" in the audience hell yeah

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @anarcho-yorpism 様 
00:12:16 - 00:15:30
Influence the dreams of others and be adaptable to change - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Influence the dreams of others and be adaptable to change

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @HarryPotter-nj3mf 様 
00:13:39 - 00:15:30
This is the best commencement speech I've ever heard. Also, the audience reaction at  cracked me up. - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

This is the best commencement speech I've ever heard. Also, the audience reaction at cracked me up.

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @johnchessant3012 様 
00:13:43 - 00:15:30
, like me so that I can listen this part again - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

, like me so that I can listen this part again

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @TheKhagendra 様 
00:14:12 - 00:15:30
"following not the dreams but the opportunities" - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

"following not the dreams but the opportunities"

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @superkaran20 様 
00:14:56 - 00:15:30
the dude who greets and makes you comfortable in unknown party - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

the dude who greets and makes you comfortable in unknown party

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @johnhammer8668 様 
00:15:17 - 00:15:30
Love the dude at  showing his enthusiasm for the class of 2024 - What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024

Love the dude at showing his enthusiasm for the class of 2024

What "Follow Your Dreams" Misses | Harvey Mudd Commencement Speech 2024
2024年05月18日  @salpicaomesquinho 様 
00:15:22 - 00:15:30
- Recap on embeddings - Attention in transformers, step-by-step | DL6

- Recap on embeddings

Attention in transformers, step-by-step | DL6
2024年04月07日 
00:00:00 - 00:01:39
*  - Transformers are key components of large language models, introduced in the 2017 paper "Attention is All You Need". - Attention in transformers, step-by-step | DL6

* - Transformers are key components of large language models, introduced in the 2017 paper "Attention is All You Need".

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:00:00 - 00:00:30
*🔍 Understanding the Attention Mechanism in Transformers*- Introduction to the attention mechanism and its significance in large language models.- Overview of the goal of transformer models to predict the next word in a piece of text.- Explanation of breaking text into tokens, associating tokens with vectors, and the use of high-dimensional embeddings to encode semantic meaning. - Attention in transformers, step-by-step | DL6

*🔍 Understanding the Attention Mechanism in Transformers*- Introduction to the attention mechanism and its significance in large language models.- Overview of the goal of transformer models to predict the next word in a piece of text.- Explanation of breaking text into tokens, associating tokens with vectors, and the use of high-dimensional embeddings to encode semantic meaning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @HarpaAI 様 
00:00:00 - 00:02:11
- Transformers use attention mechanisms to process and associate tokens with semantic meaning. - Attention in transformers, step-by-step | DL6

- Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:00:02 - 00:02:38
Crafted by Merlin AI. Transformers use attention mechanisms to process and associate tokens with semantic meaning. - Attention in transformers, step-by-step | DL6

Crafted by Merlin AI. Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:00:02 - 00:02:38
Transformers use attention mechanisms to process and associate tokens with semantic meaning. - Attention in transformers, step-by-step | DL6

Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:00:02 - 00:02:38
*  - The model aims to predict the next word in a sequence by processing input text broken down into tokens (often words or parts of words). - Attention in transformers, step-by-step | DL6

* - The model aims to predict the next word in a sequence by processing input text broken down into tokens (often words or parts of words).

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:00:30 - 00:00:54
*  - Each token is associated with a high-dimensional vector called an embedding, where directions in this space correspond to semantic meaning. - Attention in transformers, step-by-step | DL6

* - Each token is associated with a high-dimensional vector called an embedding, where directions in this space correspond to semantic meaning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:00:54 - 00:01:28
It is  in Sydney right now and I’m up late watching your video from my bed. I should probably get some sleep, I have morning classes, it’s just your content is to God damned interesting. Plus, I’m a teenager. I can’t be separated from my phone accept by 16th century French style beheading. POST MORE VIDEOS! If I can’t sleep you shouldn’t get the luxury! - Attention in transformers, step-by-step | DL6

It is in Sydney right now and I’m up late watching your video from my bed. I should probably get some sleep, I have morning classes, it’s just your content is to God damned interesting. Plus, I’m a teenager. I can’t be separated from my phone accept by 16th century French style beheading. POST MORE VIDEOS! If I can’t sleep you shouldn’t get the luxury!

Attention in transformers, step-by-step | DL6
2024年04月07日  @jeremypianofreestyle7210 様 
00:01:00 - 00:26:10
*  - Transformers progressively adjust embeddings to encode rich contextual meaning beyond individual words. - Attention in transformers, step-by-step | DL6

* - Transformers progressively adjust embeddings to encode rich contextual meaning beyond individual words.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:01:28 - 00:01:40
- Motivating examples - Attention in transformers, step-by-step | DL6

- Motivating examples

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:01:39 - 00:04:29
*  - The attention mechanism can be challenging to grasp initially. - Attention in transformers, step-by-step | DL6

* - The attention mechanism can be challenging to grasp initially.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:01:40 - 00:02:00
*  - Examples like "mole" in different contexts highlight the need for context-aware embeddings, as the initial embedding is the same regardless of context. - Attention in transformers, step-by-step | DL6

* - Examples like "mole" in different contexts highlight the need for context-aware embeddings, as the initial embedding is the same regardless of context.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:02:00 - 00:02:57
*🧠 Contextual meaning refinement in Transformers*- Illustration of how attention mechanisms refine embeddings to encode rich contextual meaning.- Examples showcasing the updating of word embeddings based on context.- Importance of attention blocks in enriching word embeddings with contextual information. - Attention in transformers, step-by-step | DL6

*🧠 Contextual meaning refinement in Transformers*- Illustration of how attention mechanisms refine embeddings to encode rich contextual meaning.- Examples showcasing the updating of word embeddings based on context.- Importance of attention blocks in enriching word embeddings with contextual information.

Attention in transformers, step-by-step | DL6
2024年04月07日  @HarpaAI 様 
00:02:11 - 00:05:37
IMP : @, @ - Attention in transformers, step-by-step | DL6

IMP : @, @

Attention in transformers, step-by-step | DL6
2024年04月07日  @INGLERAJKAMALRAJENDRA 様 
00:02:37 - 00:12:43
- Attention blocks refine word meanings based on context - Attention in transformers, step-by-step | DL6

- Attention blocks refine word meanings based on context

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:02:38 - 00:05:15
Attention blocks refine word meanings based on context - Attention in transformers, step-by-step | DL6

Attention blocks refine word meanings based on context

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:02:38 - 00:05:15
*  - Attention refines embeddings based on surrounding words; for instance, "tower" becomes more specific when preceded by "Eiffel". - Attention in transformers, step-by-step | DL6

* - Attention refines embeddings based on surrounding words; for instance, "tower" becomes more specific when preceded by "Eiffel".

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:02:57 - 00:04:29
Slightly disappointed you chose not to describe this update as moving the vector to be more "French-wards" - Attention in transformers, step-by-step | DL6

Slightly disappointed you chose not to describe this update as moving the vector to be more "French-wards"

Attention in transformers, step-by-step | DL6
2024年04月07日  @bosstowndynamics5488 様 
00:03:10 - 00:26:10
it's actually wrought iron not steel - Attention in transformers, step-by-step | DL6

it's actually wrought iron not steel

Attention in transformers, step-by-step | DL6
2024年04月07日  @danielkrajnik3817 様 
00:03:10 - 00:26:10
Thank you for the information given at , it cleared my doubt from the previous video - Attention in transformers, step-by-step | DL6

Thank you for the information given at , it cleared my doubt from the previous video

Attention in transformers, step-by-step | DL6
2024年04月07日  @Aarav-p5i3o 様 
00:04:26 - 00:26:10
*  - A simplified example with the phrase "a fluffy blue creature roamed the verdant forest" demonstrates how adjectives update nouns through attention. - Attention in transformers, step-by-step | DL6

* - A simplified example with the phrase "a fluffy blue creature roamed the verdant forest" demonstrates how adjectives update nouns through attention.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:04:29 - 00:04:56
- The attention pattern - Attention in transformers, step-by-step | DL6

- The attention pattern

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:04:29 - 00:11:08
*  - Each word's initial embedding encodes its meaning and position. - Attention in transformers, step-by-step | DL6

* - Each word's initial embedding encodes its meaning and position.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:04:56 - 00:06:14
Thank you very much for this very informative series on LLMs. I have a small question regarding the matrix dimensions though. @, we have that N_E = 12.288 is the embedding dimension. @6.40, we have that N_Q = 128 is the query embedding dimension; and so is N_K = 128, the key embedding dimension. So, @1. If the context size is N_C (= 2048 in GPT-3 as you indicate), then matrices Q = [Q_1 ... ] and K = [K_1 ...] each would have size N_Q x N_C. Whatever the size of N_C, the size of Q x K^t would be N_Q x N_Q, i.e., 128 x 128. But @ - Attention in transformers, step-by-step | DL6

Thank you very much for this very informative series on LLMs. I have a small question regarding the matrix dimensions though. @, we have that N_E = 12.288 is the embedding dimension. @6.40, we have that N_Q = 128 is the query embedding dimension; and so is N_K = 128, the key embedding dimension. So, @1. If the context size is N_C (= 2048 in GPT-3 as you indicate), then matrices Q = [Q_1 ... ] and K = [K_1 ...] each would have size N_Q x N_C. Whatever the size of N_C, the size of Q x K^t would be N_Q x N_Q, i.e., 128 x 128. But @

Attention in transformers, step-by-step | DL6
2024年04月07日  @kpremaratne 様 
00:05:01 - 00:06:50
- Transforming embeddings through matrix-vector products and tunable weights in deep learning. - Attention in transformers, step-by-step | DL6

- Transforming embeddings through matrix-vector products and tunable weights in deep learning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:05:15 - 00:07:47
Transforming embeddings through matrix-vector products and tunable weights in deep learning. - Attention in transformers, step-by-step | DL6

Transforming embeddings through matrix-vector products and tunable weights in deep learning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:05:15 - 00:07:47
*⚙️ Matrix operations and weighted sum in Attention*- Explanation of matrix-vector products and tunable weights in matrix operations.- Introduction to the concept of masked attention for preventing later tokens from influencing earlier ones.- Overview of attention patterns, softmax computations, and relevance weighting in attention mechanisms. - Attention in transformers, step-by-step | DL6

*⚙️ Matrix operations and weighted sum in Attention*- Explanation of matrix-vector products and tunable weights in matrix operations.- Introduction to the concept of masked attention for preventing later tokens from influencing earlier ones.- Overview of attention patterns, softmax computations, and relevance weighting in attention mechanisms.

Attention in transformers, step-by-step | DL6
2024年04月07日  @HarpaAI 様 
00:05:37 - 00:21:31
*  - Nouns generate "query" vectors to seek relevant adjectives. - Attention in transformers, step-by-step | DL6

* - Nouns generate "query" vectors to seek relevant adjectives.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:06:14 - 00:07:51
I dont get where the Q and K values come from.. is it from the embedings? It is said in the video the Q is like a question about the adjectives but where does it come from mathematically? Is it made up? I failed to understand. at  the question the noun is asking is "are there any adjectives sitting in front of me" while there is none, there BEHIND it, not in front, so what is it? we are reading from left to right in 2024 still right? its in the small details that this falls apart for me. then it is said that Question is "SOMEHOW" encoded as another vector.. yeah, so it just magically popped into existence? - Attention in transformers, step-by-step | DL6

I dont get where the Q and K values come from.. is it from the embedings? It is said in the video the Q is like a question about the adjectives but where does it come from mathematically? Is it made up? I failed to understand. at the question the noun is asking is "are there any adjectives sitting in front of me" while there is none, there BEHIND it, not in front, so what is it? we are reading from left to right in 2024 still right? its in the small details that this falls apart for me. then it is said that Question is "SOMEHOW" encoded as another vector.. yeah, so it just magically popped into existence?

Attention in transformers, step-by-step | DL6
2024年04月07日  @laodrofotic7713 様 
00:06:23 - 00:26:10
Can someone explain me How that questions are generated ? and keys Respond to them ? Im bit confused over there ? Question are predefined ? and how keys are created ? - Attention in transformers, step-by-step | DL6

Can someone explain me How that questions are generated ? and keys Respond to them ? Im bit confused over there ? Question are predefined ? and how keys are created ?

Attention in transformers, step-by-step | DL6
2024年04月07日  @curiosityspace8635 様 
00:06:30 - 00:26:10
, the matrix W_Q must have size N_Q x N_E; and so would the matrix W_K. So, each Q_i = W_Q x E_i and K_j = W_K x E_j would have dimension N_Q x - Attention in transformers, step-by-step | DL6

, the matrix W_Q must have size N_Q x N_E; and so would the matrix W_K. So, each Q_i = W_Q x E_i and K_j = W_K x E_j would have dimension N_Q x

Attention in transformers, step-by-step | DL6
2024年04月07日  @kpremaratne 様 
00:06:50 - 00:12:45
I love to see that you make column vectors the embeddings! Machine learning people love designating *row* vectors as embeddings/queries/keys/etc. (including the Attention paper), and this makes all the equations flipped from how we expect in math: Q = EW instead of Q = WE, etc. - Attention in transformers, step-by-step | DL6

I love to see that you make column vectors the embeddings! Machine learning people love designating *row* vectors as embeddings/queries/keys/etc. (including the Attention paper), and this makes all the equations flipped from how we expect in math: Q = EW instead of Q = WE, etc.

Attention in transformers, step-by-step | DL6
2024年04月07日  @phlaxyr 様 
00:06:55 - 00:26:10
can the query ask forwards also? like lets say we have "I saw a creature, it was huge and foul, it was eating grass" should on some level produce a similar result to "The huge and foul creature I saw, was eating grass" and the only way they'd seem similar is if the "creature" can query both forwards and backwards. - Attention in transformers, step-by-step | DL6

can the query ask forwards also? like lets say we have "I saw a creature, it was huge and foul, it was eating grass" should on some level produce a similar result to "The huge and foul creature I saw, was eating grass" and the only way they'd seem similar is if the "creature" can query both forwards and backwards.

Attention in transformers, step-by-step | DL6
2024年04月07日  @minecraftermad 様 
00:06:55 - 00:26:10
are positions of left and right tensors of multiplications somehow swapped? also many other places like the row & column of mask matrix - Attention in transformers, step-by-step | DL6

are positions of left and right tensors of multiplications somehow swapped? also many other places like the row & column of mask matrix

Attention in transformers, step-by-step | DL6
2024年04月07日  @standoasis 様 
00:06:56 - 00:12:39
interesting  so an llm could also use following words to fill out the information in the middle of a text? - Attention in transformers, step-by-step | DL6

interesting so an llm could also use following words to fill out the information in the middle of a text?

Attention in transformers, step-by-step | DL6
2024年04月07日  @nutzeeer 様 
00:07:30 - 00:26:10
- Also at , the earlier dimensional size of 128-bit for the Q, K spaces is only for multiple heads (implicitly 96 heads in this example), whereas later you correctly switch back to 12288 dimensions - Attention in transformers, step-by-step | DL6

- Also at , the earlier dimensional size of 128-bit for the Q, K spaces is only for multiple heads (implicitly 96 heads in this example), whereas later you correctly switch back to 12288 dimensions

Attention in transformers, step-by-step | DL6
2024年04月07日  @broccoli3757 様 
00:07:32 - 00:26:10
So when you hear that developers of AI don't know what AI is doing internally is that referring to how the attention layers are placing the vectors in a tensor? Is there more to it? The media makes it sound mysterious and potentially dangerous, but really it's just the method used to assign a high dimensional coordinate to a token within the context of English language. - Attention in transformers, step-by-step | DL6

So when you hear that developers of AI don't know what AI is doing internally is that referring to how the attention layers are placing the vectors in a tensor? Is there more to it? The media makes it sound mysterious and potentially dangerous, but really it's just the method used to assign a high dimensional coordinate to a token within the context of English language.

Attention in transformers, step-by-step | DL6
2024年04月07日  @j.f.c 様 
00:07:43 - 00:26:10
- Transformers use key matrix to match queries and measure relevance. - Attention in transformers, step-by-step | DL6

- Transformers use key matrix to match queries and measure relevance.

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:07:47 - 00:10:31
Transformers use key matrix to match queries and measure relevance. - Attention in transformers, step-by-step | DL6

Transformers use key matrix to match queries and measure relevance.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:07:47 - 00:10:31
*  - "Key" vectors are created for each word and compared with queries using dot products to assess relevance. - Attention in transformers, step-by-step | DL6

* - "Key" vectors are created for each word and compared with queries using dot products to assess relevance.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:07:51 - 00:08:59
*  - The resulting grid of dot products, after softmax normalization, represents the attention pattern, indicating how each word relates to others. - Attention in transformers, step-by-step | DL6

* - The resulting grid of dot products, after softmax normalization, represents the attention pattern, indicating how each word relates to others.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:08:59 - 00:11:09
Great video as always! Minor quibble at , I have always heard and understood “attend to” as being from the perspective of the query (the video uses the key’s perspective) so it would be “the embedding of creature attends to fluffy and blue” instead. It doesn’t really matter since the dot product is symmetric, I just haven’t heard it used colloquially that direction (maybe due to the axis that the softmax is applied on?) - Attention in transformers, step-by-step | DL6

Great video as always! Minor quibble at , I have always heard and understood “attend to” as being from the perspective of the query (the video uses the key’s perspective) so it would be “the embedding of creature attends to fluffy and blue” instead. It doesn’t really matter since the dot product is symmetric, I just haven’t heard it used colloquially that direction (maybe due to the axis that the softmax is applied on?)

Attention in transformers, step-by-step | DL6
2024年04月07日  @seblund 様 
00:09:00 - 00:26:10
Soft max.Eye opening - Attention in transformers, step-by-step | DL6

Soft max.Eye opening

Attention in transformers, step-by-step | DL6
2024年04月07日  @pavanreddy4611 様 
00:09:54 - 00:26:10
Amazing work; thank you for doing it. Now, am I misunderstanding something, or is there possibly a mistake at  in the "roamed" column? The weight for the word "the" is 0.99 even though it appears _after_ "roamed" in the context. This frightens me, as math can't ever be wrong. - Attention in transformers, step-by-step | DL6

Amazing work; thank you for doing it. Now, am I misunderstanding something, or is there possibly a mistake at in the "roamed" column? The weight for the word "the" is 0.99 even though it appears _after_ "roamed" in the context. This frightens me, as math can't ever be wrong.

Attention in transformers, step-by-step | DL6
2024年04月07日  @chrismontanaro7155 様 
00:10:10 - 00:26:10
Possible error at (): *Q_i* and *K_j* should be _row vectors_ so that *{QKᵀ}_{i,j} = Q_i ⋅ K_j* is their dot product. - Attention in transformers, step-by-step | DL6

Possible error at (): *Q_i* and *K_j* should be _row vectors_ so that *{QKᵀ}_{i,j} = Q_i ⋅ K_j* is their dot product.

Attention in transformers, step-by-step | DL6
2024年04月07日  @muntoonxt 様 
00:10:30 - 00:26:10
Very good video, just a small question:  - If you’re treating vectors as column vectors from a math perspective, shouldn’t it be Vsoftmax(KᵗQ)?? The original paper puts V on right side and uses softmax(QKᵗ)V because i think it assumes row vectors by default which makes more sense from a computing perspective due to memory efficiency. - Attention in transformers, step-by-step | DL6

Very good video, just a small question: - If you’re treating vectors as column vectors from a math perspective, shouldn’t it be Vsoftmax(KᵗQ)?? The original paper puts V on right side and uses softmax(QKᵗ)V because i think it assumes row vectors by default which makes more sense from a computing perspective due to memory efficiency.

Attention in transformers, step-by-step | DL6
2024年04月07日  @ZhifanSong 様 
00:10:30 - 00:26:10
- Attention mechanism ensures no later words influence earlier words - Attention in transformers, step-by-step | DL6

- Attention mechanism ensures no later words influence earlier words

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:10:31 - 00:12:55
Attention mechanism ensures no later words influence earlier words - Attention in transformers, step-by-step | DL6

Attention mechanism ensures no later words influence earlier words

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:10:31 - 00:12:55
- Masking - Attention in transformers, step-by-step | DL6

- Masking

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:11:08 - 00:12:42
*  - During training, the model predicts the next token for various subsequences, requiring masking to prevent future tokens from influencing past predictions. - Attention in transformers, step-by-step | DL6

* - During training, the model predicts the next token for various subsequences, requiring masking to prevent future tokens from influencing past predictions.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:11:09 - 00:12:16
*  - Masking sets irrelevant attention pattern entries to negative infinity before softmax, resulting in zeros after normalization. - Attention in transformers, step-by-step | DL6

* - Masking sets irrelevant attention pattern entries to negative infinity before softmax, resulting in zeros after normalization.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:12:16 - 00:12:42
(usually set upper right corner to -inf) - Attention in transformers, step-by-step | DL6

(usually set upper right corner to -inf)

Attention in transformers, step-by-step | DL6
2024年04月07日  @standoasis 様 
00:12:39 - 00:26:10
*  - Attention pattern size scales with the square of the context size, making larger contexts computationally expensive. - Attention in transformers, step-by-step | DL6

* - Attention pattern size scales with the square of the context size, making larger contexts computationally expensive.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:12:42 - 00:13:10
- Context size - Attention in transformers, step-by-step | DL6

- Context size

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:12:42 - 00:13:10
Motivation for Masking not entirely clear, will need to rewatch it to better understand - Attention in transformers, step-by-step | DL6

Motivation for Masking not entirely clear, will need to rewatch it to better understand

Attention in transformers, step-by-step | DL6
2024年04月07日  @INGLERAJKAMALRAJENDRA 様 
00:12:43 - 00:26:10
, you say that the size of Q x K^t matrix is N_C x N_C. Can you please explain this discrepancy? This also leads to another problem: We need to multiple Q x K^t by V. So, what would be the size of V be? Thank you very much. - Attention in transformers, step-by-step | DL6

, you say that the size of Q x K^t matrix is N_C x N_C. Can you please explain this discrepancy? This also leads to another problem: We need to multiple Q x K^t by V. So, what would be the size of V be? Thank you very much.

Attention in transformers, step-by-step | DL6
2024年04月07日  @kpremaratne 様 
00:12:45 - 00:26:10
- Attention mechanism variations aim at making context more scalable. - Attention in transformers, step-by-step | DL6

- Attention mechanism variations aim at making context more scalable.

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:12:55 - 00:15:26
Attention mechanism variations aim at making context more scalable. - Attention in transformers, step-by-step | DL6

Attention mechanism variations aim at making context more scalable.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:12:55 - 00:15:26
*  - A "value" matrix determines how embeddings should be updated based on relevance. - Attention in transformers, step-by-step | DL6

* - A "value" matrix determines how embeddings should be updated based on relevance.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:13:10 - 00:14:07
- Values - Attention in transformers, step-by-step | DL6

- Values

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:13:10 - 00:15:44
Can you explaine me, when you added the matrix W, what are the values ​​in it? In video only says that you need to multiply by these values, but what are the values ​​initially? - Attention in transformers, step-by-step | DL6

Can you explaine me, when you added the matrix W, what are the values ​​in it? In video only says that you need to multiply by these values, but what are the values ​​initially?

Attention in transformers, step-by-step | DL6
2024年04月07日  @user--------- 様 
00:13:10 - 00:26:10
How does the attention mechanism avoid getting caught in a sort of loop? For example, in the expression "fluffy creature", "fluffy" clearly modifies "creature", i.e. "creature" as in "fluffy creature" as opposed to "spiky creature". However, the specific noun in question also modifies the meaning of the adjective. For example, "fluffy" as in "fluffy creature" is not the same as "fluffy" as in "fluffy argument". In a sense, humans evaluate these things quite atomically. Is there a sort of back-and-forth iteration that exits after a certain point? If so, on what criteria? - Attention in transformers, step-by-step | DL6

How does the attention mechanism avoid getting caught in a sort of loop? For example, in the expression "fluffy creature", "fluffy" clearly modifies "creature", i.e. "creature" as in "fluffy creature" as opposed to "spiky creature". However, the specific noun in question also modifies the meaning of the adjective. For example, "fluffy" as in "fluffy creature" is not the same as "fluffy" as in "fluffy argument". In a sense, humans evaluate these things quite atomically. Is there a sort of back-and-forth iteration that exits after a certain point? If so, on what criteria?

Attention in transformers, step-by-step | DL6
2024年04月07日  @simonr-vp4if 様 
00:13:34 - 00:26:10
*  - Value vectors are added to embeddings based on the attention pattern weights, refining the meaning of words based on context. - Attention in transformers, step-by-step | DL6

* - Value vectors are added to embeddings based on the attention pattern weights, refining the meaning of words based on context.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:14:07 - 00:15:44
*if* this word is relevant to adjusting the meaning of something else... - Attention in transformers, step-by-step | DL6

*if* this word is relevant to adjusting the meaning of something else...

Attention in transformers, step-by-step | DL6
2024年04月07日  @tornyu 様 
00:14:10 - 00:26:10
@ Shouldn't the main diagonal in the attention pattern matrix (query-key dot product) also be zero, i.e. a word cannot give additional context to update its own embedding? - Attention in transformers, step-by-step | DL6

@ Shouldn't the main diagonal in the attention pattern matrix (query-key dot product) also be zero, i.e. a word cannot give additional context to update its own embedding?

Attention in transformers, step-by-step | DL6
2024年04月07日  @accident_prone 様 
00:14:24 - 00:26:10
for video content from  to 4 is already the output after undergoing self attention mechanism. From the matrix, it can also be seen that the attention weights at most diagonal positions are 1 or close to 1. So, why do we still need E4+Δ E4? - Attention in transformers, step-by-step | DL6

for video content from to 4 is already the output after undergoing self attention mechanism. From the matrix, it can also be seen that the attention weights at most diagonal positions are 1 or close to 1. So, why do we still need E4+Δ E4?

Attention in transformers, step-by-step | DL6
2024年04月07日  @GuadalupeLee-sr8fi 様 
00:15:13 - 00:15:20
? I personally believe that Δ E - Attention in transformers, step-by-step | DL6

? I personally believe that Δ E

Attention in transformers, step-by-step | DL6
2024年04月07日  @GuadalupeLee-sr8fi 様 
00:15:20 - 00:15:32
At  when describing the updating of a given embedding vector with the preceding embeddings selected for by the attention mechanism, I'm not understanding the need for transforming them to value vectors. What does this EiWv=Vi transformation provide that simply taking the attention discounted sum of the Ei's and updating your embedding directly doesnt? - Attention in transformers, step-by-step | DL6

At when describing the updating of a given embedding vector with the preceding embeddings selected for by the attention mechanism, I'm not understanding the need for transforming them to value vectors. What does this EiWv=Vi transformation provide that simply taking the attention discounted sum of the Ei's and updating your embedding directly doesnt?

Attention in transformers, step-by-step | DL6
2024年04月07日  @jacobhm7429 様 
00:15:25 - 00:26:10
- Transformers use weighted sums to produce refined embeddings from attention - Attention in transformers, step-by-step | DL6

- Transformers use weighted sums to produce refined embeddings from attention

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:15:26 - 00:17:58
Transformers use weighted sums to produce refined embeddings from attention - Attention in transformers, step-by-step | DL6

Transformers use weighted sums to produce refined embeddings from attention

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:15:26 - 00:17:58
- I think there is an error at , where E5 is shown attending to E6 (value 0.99 shown) which is a forward (future) dependency and should be masked (i.e., set to zero). - Attention in transformers, step-by-step | DL6

- I think there is an error at , where E5 is shown attending to E6 (value 0.99 shown) which is a forward (future) dependency and should be masked (i.e., set to zero).

Attention in transformers, step-by-step | DL6
2024年04月07日  @broccoli3757 様 
00:15:28 - 00:07:32
is this just a matrix multiplication? How do you go from value matrices V and attention scores K^T Q to delta E - Attention in transformers, step-by-step | DL6

is this just a matrix multiplication? How do you go from value matrices V and attention scores K^T Q to delta E

Attention in transformers, step-by-step | DL6
2024年04月07日  @maruftalukdar1805 様 
00:15:30 - 00:26:10
) For the  content, it seems that Δ E5 should not receive information about V6, as ⊿E5 can only receive information about V1-V5 at most. Why is it Δ E5=0.9 * V6 in the video?Thank you very much! - Attention in transformers, step-by-step | DL6

) For the content, it seems that Δ E5 should not receive information about V6, as ⊿E5 can only receive information about V1-V5 at most. Why is it Δ E5=0.9 * V6 in the video?Thank you very much!

Attention in transformers, step-by-step | DL6
2024年04月07日  @GuadalupeLee-sr8fi 様 
00:15:32 - 00:26:10
Just something I didn't fully understand, in   it says to the deltas (computed by the attention) are added to the context-free word embeddings to create an in-context embedding.Where is this addition taking place? did not managed to see where it is located in the attention is all you need paper. - Attention in transformers, step-by-step | DL6

Just something I didn't fully understand, in it says to the deltas (computed by the attention) are added to the context-free word embeddings to create an in-context embedding.Where is this addition taking place? did not managed to see where it is located in the attention is all you need paper.

Attention in transformers, step-by-step | DL6
2024年04月07日  @itamarhadad1994 様 
00:15:36 - 00:26:10
At , I think it is possible to compact the operation into matrix multiplication, then add the columns to the original word vectors. - Attention in transformers, step-by-step | DL6

At , I think it is possible to compact the operation into matrix multiplication, then add the columns to the original word vectors.

Attention in transformers, step-by-step | DL6
2024年04月07日  @brightlin777 様 
00:15:41 - 00:26:10
*  - A single attention head involves key, query, and value matrices, with GPT-3 using a 128-dimensional key/query space and a 12,288-dimensional embedding space. - Attention in transformers, step-by-step | DL6

* - A single attention head involves key, query, and value matrices, with GPT-3 using a 128-dimensional key/query space and a 12,288-dimensional embedding space.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:15:44 - 00:16:45
- Counting parameters - Attention in transformers, step-by-step | DL6

- Counting parameters

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:15:44 - 00:18:21
, the cell E - Attention in transformers, step-by-step | DL6

, the cell E

Attention in transformers, step-by-step | DL6
2024年04月07日  @nanxlu 様 
00:15:45 - 00:26:10
At  and - Attention in transformers, step-by-step | DL6

At and

Attention in transformers, step-by-step | DL6
2024年04月07日  @VincentYCYao 様 
00:15:49 - 00:21:35
*  - Value matrices are factored into "value down" and "value up" matrices to improve efficiency, resulting in approximately 6.3 million parameters per head. - Attention in transformers, step-by-step | DL6

* - Value matrices are factored into "value down" and "value up" matrices to improve efficiency, resulting in approximately 6.3 million parameters per head.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:16:45 - 00:18:22
: is it only due to efficiency as you said in , or is there also an intuitive reason that the rank (degrees of freedom) of the value map should not be more than the rank of the query and key matrices? - Attention in transformers, step-by-step | DL6

: is it only due to efficiency as you said in , or is there also an intuitive reason that the rank (degrees of freedom) of the value map should not be more than the rank of the query and key matrices?

Attention in transformers, step-by-step | DL6
2024年04月07日  @rockymandayam9240 様 
00:16:55 - 00:26:10
But it seems a there is minor bug in the video at , where the "value down" matrix is expalined - shouldn't the Intermediated result vecior at this point only be 128-element, and not 12,288 as shown? NArration does explain we are mapping to a lower-dimensional space. (and correspondigly, the input to the "Value Up" matricx whould be this 128-size vector, generating a 12,288 size result - Attention in transformers, step-by-step | DL6

But it seems a there is minor bug in the video at , where the "value down" matrix is expalined - shouldn't the Intermediated result vecior at this point only be 128-element, and not 12,288 as shown? NArration does explain we are mapping to a lower-dimensional space. (and correspondigly, the input to the "Value Up" matricx whould be this 128-size vector, generating a 12,288 size result

Attention in transformers, step-by-step | DL6
2024年04月07日  @bluestarwars 様 
00:17:24 - 00:26:10
-- Love the 3b1b humblebrag here. essentially "Those paper writers make things confusing, and I am here to lead you with knowledge". Thank you Grant for bringing this to all of us! - Attention in transformers, step-by-step | DL6

-- Love the 3b1b humblebrag here. essentially "Those paper writers make things confusing, and I am here to lead you with knowledge". Thank you Grant for bringing this to all of us!

Attention in transformers, step-by-step | DL6
2024年04月07日  @3Max 様 
00:17:50 - 00:26:10
- Self-attention mechanism explained with parameter count and cross-attention differentiation. - Attention in transformers, step-by-step | DL6

- Self-attention mechanism explained with parameter count and cross-attention differentiation.

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:17:58 - 00:20:08
Self-attention mechanism explained with parameter count and cross-attention differentiation. - Attention in transformers, step-by-step | DL6

Self-attention mechanism explained with parameter count and cross-attention differentiation.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:17:58 - 00:20:08
- doesn't that also mean we're reducing the information in the embedded vectors to the smaller amount of dimensions in the key/query space? - Attention in transformers, step-by-step | DL6

- doesn't that also mean we're reducing the information in the embedded vectors to the smaller amount of dimensions in the key/query space?

Attention in transformers, step-by-step | DL6
2024年04月07日  @DracarmenWinterspring 様 
00:18:06 - 00:26:10
- Cross-attention - Attention in transformers, step-by-step | DL6

- Cross-attention

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:18:21 - 00:19:19
*  - Cross-attention is a variation used in models processing different data types (e.g., translation), where keys and queries come from separate datasets. - Attention in transformers, step-by-step | DL6

* - Cross-attention is a variation used in models processing different data types (e.g., translation), where keys and queries come from separate datasets.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:18:22 - 00:19:20
At  not necessarily - cross attention can work between two sequences of the same modality, like T5. It's just that one sequence is seen as the input or information the model should attend to, and the second sequence is the output. - Attention in transformers, step-by-step | DL6

At not necessarily - cross attention can work between two sequences of the same modality, like T5. It's just that one sequence is seen as the input or information the model should attend to, and the second sequence is the output.

Attention in transformers, step-by-step | DL6
2024年04月07日  @HoriaCristescu 様 
00:18:40 - 00:26:10
- Multiple heads - Attention in transformers, step-by-step | DL6

- Multiple heads

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:19:19 - 00:22:16
*  - Multi-headed attention runs multiple attention heads in parallel to capture various contextual relationships. - Attention in transformers, step-by-step | DL6

* - Multi-headed attention runs multiple attention heads in parallel to capture various contextual relationships.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:19:20 - 00:20:52
GPT-3 Engineers: "So looking at it bro we gotta go ahead and get at least 10,000" - Attention in transformers, step-by-step | DL6

GPT-3 Engineers: "So looking at it bro we gotta go ahead and get at least 10,000"

Attention in transformers, step-by-step | DL6
2024年04月07日  @klingeron5929 様 
00:19:25 - 00:26:10
Thank you for this explanation!  Not to quibble, but "brakes" spelled incorrectly at . - Attention in transformers, step-by-step | DL6

Thank you for this explanation! Not to quibble, but "brakes" spelled incorrectly at .

Attention in transformers, step-by-step | DL6
2024年04月07日  @SteveRowe 様 
00:20:07 - 00:26:10
- Transformers use multi-headed attention to capture different attention patterns - Attention in transformers, step-by-step | DL6

- Transformers use multi-headed attention to capture different attention patterns

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:20:08 - 00:22:34
Transformers use multi-headed attention to capture different attention patterns - Attention in transformers, step-by-step | DL6

Transformers use multi-headed attention to capture different attention patterns

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:20:08 - 00:22:34
In the example at , in the “John hits the breaks sharply” the word “break” means to separate into pieces, whereas “brake” refers to a device used for slowing motion.  Clearly the word “brake” is appropriate.  This in itself presents an interesting problem for the model to address. The context of the inappropriate use of the word “break” must cause the model to effectively “correct” for this error.  Can anyone expand on this concept since the use of language by humans is inherently imperfect.  Very interesting and informative series of videos. - Attention in transformers, step-by-step | DL6

In the example at , in the “John hits the breaks sharply” the word “break” means to separate into pieces, whereas “brake” refers to a device used for slowing motion. Clearly the word “brake” is appropriate. This in itself presents an interesting problem for the model to address. The context of the inappropriate use of the word “break” must cause the model to effectively “correct” for this error. Can anyone expand on this concept since the use of language by humans is inherently imperfect. Very interesting and informative series of videos.

Attention in transformers, step-by-step | DL6
2024年04月07日  @RobertReynolds-b9p 様 
00:20:22 - 00:26:10
*  - GPT-3 uses 96 heads, each with distinct key, query, and value maps, enabling the model to learn diverse ways context affects meaning. - Attention in transformers, step-by-step | DL6

* - GPT-3 uses 96 heads, each with distinct key, query, and value maps, enabling the model to learn diverse ways context affects meaning.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:20:52 - 00:21:36
animation at  🔥 - Attention in transformers, step-by-step | DL6

animation at 🔥

Attention in transformers, step-by-step | DL6
2024年04月07日  @RamithHettiarachchi 様 
00:20:58 - 00:26:10
I have a question at , so I have read something in which, let say original embedding is of C dimension, and in multi head of attention block, the output of the head is of A dimension which is C/number of head. for example if c is 48 and we have 3 heads in attention block, each head output would be of 16 dimension. now we cannot possibly add a 48 dimension with 16 dimension. - Attention in transformers, step-by-step | DL6

I have a question at , so I have read something in which, let say original embedding is of C dimension, and in multi head of attention block, the output of the head is of A dimension which is C/number of head. for example if c is 48 and we have 3 heads in attention block, each head output would be of 16 dimension. now we cannot possibly add a 48 dimension with 16 dimension.

Attention in transformers, step-by-step | DL6
2024年04月07日  @AryanIITIndore 様 
00:21:20 - 00:26:10
at , why don't we normalize the varations produced and added by multi attention blocks by dividing the whole sum by the number of blocks (96 right here). In the current situation, I have the feeling that we are adding the variation 96 times more than we need to to the previous embbeding. - Attention in transformers, step-by-step | DL6

at , why don't we normalize the varations produced and added by multi attention blocks by dividing the whole sum by the number of blocks (96 right here). In the current situation, I have the feeling that we are adding the variation 96 times more than we need to to the previous embbeding.

Attention in transformers, step-by-step | DL6
2024年04月07日  @Turkish_coffee_42 様 
00:21:25 - 00:26:10
At  is there any paper reflecting on how many of these attention heads are redundant? e.g logging at training percentage of attention heads that actually contribute to the change of embedding and possibly drop some of these - Attention in transformers, step-by-step | DL6

At is there any paper reflecting on how many of these attention heads are redundant? e.g logging at training percentage of attention heads that actually contribute to the change of embedding and possibly drop some of these

Attention in transformers, step-by-step | DL6
2024年04月07日  @ManuelRavasqueira 様 
00:21:27 - 00:26:10
*🧠 Multi-Headed Attention Mechanism in Transformers*- Explanation of how each attention head has distinct value matrices for producing value vectors.- Introduction to the process of summing proposed changes from different heads to refine embeddings in each position.- Importance of running multiple heads in parallel to capture diverse contextual meanings efficiently. - Attention in transformers, step-by-step | DL6

*🧠 Multi-Headed Attention Mechanism in Transformers*- Explanation of how each attention head has distinct value matrices for producing value vectors.- Introduction to the process of summing proposed changes from different heads to refine embeddings in each position.- Importance of running multiple heads in parallel to capture diverse contextual meanings efficiently.

Attention in transformers, step-by-step | DL6
2024年04月07日  @HarpaAI 様 
00:21:31 - 00:22:34
, you represented one output of the attention layer as E'=deltaE+E . I am wondering where does the deltaE come from? The matrix multiplication already represents weighted sum: V'=atten(Q,K,V)=softmax(.)V. That is, each output vector in V' is the weighted sum of all vectors in V. - Attention in transformers, step-by-step | DL6

, you represented one output of the attention layer as E'=deltaE+E . I am wondering where does the deltaE come from? The matrix multiplication already represents weighted sum: V'=atten(Q,K,V)=softmax(.)V. That is, each output vector in V' is the weighted sum of all vectors in V.

Attention in transformers, step-by-step | DL6
2024年04月07日  @VincentYCYao 様 
00:21:35 - 00:26:10
*  - The proposed changes from each head are summed and added to the original embedding, resulting in a refined embedding. - Attention in transformers, step-by-step | DL6

* - The proposed changes from each head are summed and added to the original embedding, resulting in a refined embedding.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:21:36 - 00:22:16
in , if I am not mistaken, the result from different heads are concatenated to a higher dimension matrix and project back to the original one, instead of simply adding them up together. - Attention in transformers, step-by-step | DL6

in , if I am not mistaken, the result from different heads are concatenated to a higher dimension matrix and project back to the original one, instead of simply adding them up together.

Attention in transformers, step-by-step | DL6
2024年04月07日  @nickname8668 様 
00:21:40 - 00:26:10
I have a question. At , why don't you take the average of all those propesed changes? If you had a lot of attention heads, wouldnt they all together overestimate the change that should be done to the original embedding of a token? Or is this problem automatically fixed by the backpropagation algorithm so that each change calculated by an attention head is lower than it woudltn if been when there was only 1 attention head in the attention block? - Attention in transformers, step-by-step | DL6

I have a question. At , why don't you take the average of all those propesed changes? If you had a lot of attention heads, wouldnt they all together overestimate the change that should be done to the original embedding of a token? Or is this problem automatically fixed by the backpropagation algorithm so that each change calculated by an attention head is lower than it woudltn if been when there was only 1 attention head in the attention block?

Attention in transformers, step-by-step | DL6
2024年04月07日  @jjksounds5250 様 
00:21:47 - 00:26:10
- The output matrix - Attention in transformers, step-by-step | DL6

- The output matrix

Attention in transformers, step-by-step | DL6
2024年04月07日 
00:22:16 - 00:23:19
*  - In practice, "value up" matrices for all heads are combined into a single "output matrix" for efficiency. - Attention in transformers, step-by-step | DL6

* - In practice, "value up" matrices for all heads are combined into a single "output matrix" for efficiency.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:22:16 - 00:23:19
- The - Attention in transformers, step-by-step | DL6

- The

Attention in transformers, step-by-step | DL6
2024年04月07日  @AISmartEdge 様 
00:22:16 - 00:26:10
*🛠️ Technical Details in Implementing Value Matrices*- Description of the implementation difference in the value matrices as a single output matrix.- Clarification regarding technical nuances in how value matrices are structured in practice.- Noting the distinction between value down and value up matrices commonly seen in papers and implementations. - Attention in transformers, step-by-step | DL6

*🛠️ Technical Details in Implementing Value Matrices*- Description of the implementation difference in the value matrices as a single output matrix.- Clarification regarding technical nuances in how value matrices are structured in practice.- Noting the distinction between value down and value up matrices commonly seen in papers and implementations.

Attention in transformers, step-by-step | DL6
2024年04月07日  @HarpaAI 様 
00:22:34 - 00:24:03
- Implementation of attention differs in practice - Attention in transformers, step-by-step | DL6

- Implementation of attention differs in practice

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:22:34 - 00:24:53
Implementation of attention differs in practice - Attention in transformers, step-by-step | DL6

Implementation of attention differs in practice

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:22:34 - 00:24:53
Great video. As usual! Im stuck at the explaination at . the visualization shows that the projection up matrices are concatenated into the output matrix. The explaination says that the concatenated is then multiplied by the output matrix (itself?).if this is a typo and he means "multiply by projection down matrices". how does this work? i remember matrix multiplication only working if the dimensions match. like (n x m) * (m x k) where m has to be the same dimension.. Thanks! - Attention in transformers, step-by-step | DL6

Great video. As usual! Im stuck at the explaination at . the visualization shows that the projection up matrices are concatenated into the output matrix. The explaination says that the concatenated is then multiplied by the output matrix (itself?).if this is a typo and he means "multiply by projection down matrices". how does this work? i remember matrix multiplication only working if the dimensions match. like (n x m) * (m x k) where m has to be the same dimension.. Thanks!

Attention in transformers, step-by-step | DL6
2024年04月07日  @legitqs4098 様 
00:23:10 - 00:26:10
- Going deeper - Attention in transformers, step-by-step | DL6

- Going deeper

Attention in transformers, step-by-step | DL6
2024年04月07日 
00:23:19 - 00:24:54
*  - Data flows through multiple attention blocks and other operations, allowing for increasingly nuanced and abstract encoding of information. - Attention in transformers, step-by-step | DL6

* - Data flows through multiple attention blocks and other operations, allowing for increasingly nuanced and abstract encoding of information.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:23:19 - 00:24:16
Overall very good explanation, just one question: I saw the animation like  many times in chapter 5 and 6, it is showing later words are updating earlier words. But since you explicitly mentioned the masking in the video and the pinned comment, I am confused. I am leaning towards its a typo. Same as 6E5 should be masked as 0.00 but it shows as 0.99. - Attention in transformers, step-by-step | DL6

Overall very good explanation, just one question: I saw the animation like many times in chapter 5 and 6, it is showing later words are updating earlier words. But since you explicitly mentioned the masking in the video and the pinned comment, I am confused. I am leaning towards its a typo. Same as 6E5 should be masked as 0.00 but it shows as 0.99.

Attention in transformers, step-by-step | DL6
2024年04月07日  @nanxlu 様 
00:23:20 - 00:15:45
Thanks for the video! Does anybody know why the glowing attention lines were drawn going both ways (e. g. ), when we chop off the lower part of the attention matrix? Shouldn't this mean that the lines should only go forward (to the right)? - Attention in transformers, step-by-step | DL6

Thanks for the video! Does anybody know why the glowing attention lines were drawn going both ways (e. g. ), when we chop off the lower part of the attention matrix? Shouldn't this mean that the lines should only go forward (to the right)?

Attention in transformers, step-by-step | DL6
2024年04月07日  @StepanKorney 様 
00:23:35 - 00:26:10
*💡 Embedding Nuances and Capacity for Higher-Level Encoding*- Discussion on how embeddings become more nuanced as data flows through multiple transformers and layers.- Exploration of the capacity of transformers to encode complex concepts beyond surface-level descriptors.- Overview of the network parameters associated with attention heads and the total parameters devoted to the entire transformer model. - Attention in transformers, step-by-step | DL6

*💡 Embedding Nuances and Capacity for Higher-Level Encoding*- Discussion on how embeddings become more nuanced as data flows through multiple transformers and layers.- Exploration of the capacity of transformers to encode complex concepts beyond surface-level descriptors.- Overview of the network parameters associated with attention heads and the total parameters devoted to the entire transformer model.

Attention in transformers, step-by-step | DL6
2024年04月07日  @HarpaAI 様 
00:24:03 - 00:26:10
One question concerning .Does every new vector added to the initial meaning of "one" represent the new learned more refined meaning for each attention head or each attention layer?I think it is each layer, but on the other side, every attention head seems to learn a different way how context changes meaning, so it could be both.. - Attention in transformers, step-by-step | DL6

One question concerning .Does every new vector added to the initial meaning of "one" represent the new learned more refined meaning for each attention head or each attention layer?I think it is each layer, but on the other side, every attention head seems to learn a different way how context changes meaning, so it could be both..

Attention in transformers, step-by-step | DL6
2024年04月07日  @JonnyInChina 様 
00:24:12 - 00:26:10
*  - GPT-3's 96 layers contain about 58 billion parameters devoted to attention heads, representing a significant portion of the total model parameters. - Attention in transformers, step-by-step | DL6

* - GPT-3's 96 layers contain about 58 billion parameters devoted to attention heads, representing a significant portion of the total model parameters.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:24:16 - 00:24:54
- Attention mechanism's success lies in parallelizability for fast computations. - Attention in transformers, step-by-step | DL6

- Attention mechanism's success lies in parallelizability for fast computations.

Attention in transformers, step-by-step | DL6
2024年04月07日  @NithinKandula 様 
00:24:53 - 00:26:10
Attention mechanism's success lies in parallelizability for fast computations. - Attention in transformers, step-by-step | DL6

Attention mechanism's success lies in parallelizability for fast computations.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:24:53 - 00:00:02
Attention mechanism's success lies in parallelizability for fast computations.Crafted by Merlin AI. - Attention in transformers, step-by-step | DL6

Attention mechanism's success lies in parallelizability for fast computations.Crafted by Merlin AI.

Attention in transformers, step-by-step | DL6
2024年04月07日  @user-yl7sv2ec7y 様 
00:24:53 - 00:26:10
- Ending - Attention in transformers, step-by-step | DL6

- Ending

Attention in transformers, step-by-step | DL6
2024年04月07日 
00:24:54 - 00:26:10
*  - The success of attention is partly due to its parallelizability, enabling efficient computation with GPUs. - Attention in transformers, step-by-step | DL6

* - The success of attention is partly due to its parallelizability, enabling efficient computation with GPUs.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:24:54 - 00:25:09
Love this. If you ever edit this again, at , “brakes” is misspelled as “breaks”. - Attention in transformers, step-by-step | DL6

Love this. If you ever edit this again, at , “brakes” is misspelled as “breaks”.

Attention in transformers, step-by-step | DL6
2024年04月07日  @michaellanham2273 様 
00:24:54 - 00:26:10
*  - Parallelizable architectures are advantageous for deep learning as model performance often improves with scale. - Attention in transformers, step-by-step | DL6

* - Parallelizable architectures are advantageous for deep learning as model performance often improves with scale.

Attention in transformers, step-by-step | DL6
2024年04月07日  @wolpumba4099 様 
00:25:09 - 00:26:10