動画数:214件

- Where facts in LLMs live

Starts at

Wait we don't actually know how it works fully?

hold, they build the LLM, and don’t know how the facts are cataloged? This is gonna be a doozy.

- Quick refresher on transformers

My understanding is that an LLM does not in fact store facts, but through the process of predicting word associations through being trained on an absolutely astoundingly large set of examples, it "stores" the likelihood that the word "basketball" is historically the most likely next word in the series. It doesn't have any grasp of a concept of basketball in any sort of meaningful or even static way. This is exactly the problem I'm trying to solve, and honestly I think I found a solution. I just don't know yet how reliable it is on a large scale, or how economic it is in terms of the required computing power. We'll see.

to

what would be simplified is half a circumference with two half circumferences inside it, into infinity... that would show some precision...as to how the machine AI is doing this efficiently...while increasing the accuracy of predictions... the selection line on the graph..sequence of vectors lines...attention lines

: The tokens (words) convey context information in each other making the embedding a richer/nuanced version than a simple meaning of the word. When this animation is shown, the arrows are shown moving from a later token to earlier token as well. Isn't this contradictory to the concept introduced during masking where it is said that only earlier words are allowed to enrich the later words. (This is a common animation shown multiple time in this series).

Only the joker would pick stranger over stronger

Live in a "high dimension" Please expand.

I was unable to reproduce woman-man ~= aunt-uncle using either OpenAIEmbedding model 'text-embedding-3-small' or the older 'text-embedding-ada-002' model using LangChain. Cosine similarity of 0.29. I tried lots of pairings: aunt-uncle, woman-man, sister-brother, and queen-king. All had cosine similarities in the range 0.29 to 0.38. Happy to share my work if you're curious.

this subtraction of vector makes me wonder if all of category theory can be described using linear algebra

If we consider this higher-dimensional embedding space, in which each direction encodes a specific meaning, each vector in this space represents a certain distinct concept, right? (a vector 'meaning' man, woman, uncle, or aunt, as per the example at ).

Kinda like how neural synapses work...when neurons wire together, they fire together and then it tickles one of the adjacent "dormant" neurons and it lights up with a memory like, "Oh yeah! Totally forgot about that until you just mentioned it again to me...." right?
![I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ? - How might LLMs store facts | DL7](https://img.youtube.com/vi/9-Jl0dxWQs8/mqdefault.jpg)
I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ?

- Assumptions for our toy example

This is unironically how I understand "the spectrum" of autism, for example.

in are you impliying that the vectors are not normalized and therefore a dot product of 1 does not mean they are parallel? So what we call semantic similarity is not a measure of pointing towards the same direction? So it can be that the dot product is 1 in several directions at the same time

I don't understand the assumptions made around about dot products. Why dot product being 1 is used to mean that the vector encodes that particular direction/concept? I would have thought that the vector needs to be parallel to that concept vector to assume it encodes that concept. But then a vector would only be able to encode one concept. Is this why dot product=1 is just sort of conventionally chosen?

How can dot product of a vector with both "Michael" and "Jordan" be 1 when earlier it was said that "Michael" and "Jordan" are nearly orthogonal to each other?

- Inside a multilayer perceptron

Does that sequence of high-dimension vectors (let’s call it a 1D array) in the MLP behave as its own tensor in the LLM?

You are telling me that the AI Is akinator

Who determines "bias" or is it a "vector" with a "code" as well?

) otherwise all borders go through

, "...continuing with the deep learning tradition of overly fancy names..." 😂🤣😂

So this is just an "if, then" function?

The Bias exists to move the border between yes or no (see 0. It is literally the b in y=mx+b. Without it all y=mx go through (0,0)

So the weights are simultaneously nudged to form vector encodings for output words as columns, and patterns in rows to get the values of how much each column should be used based on multiplication by input?

As for the question of what the bias does - it's just a control at what height you put the threshold of the RELU. This way you can clip the data at different values depending on the context.

At I think this might be a misinterpretation. The MLP block uses the same 50,000 neurons for all tokens in the sequence and not 50,000 neurons per token. @3blue1brown is that correct?

I'm wondering if the phrasing here is a bit misleading. Unless i'm missing something, the block has 50000 neurons but the sequnece of tokens is passed through it, meaning you get number of activations multiplied with number of tokens, not neurons per se. This part might lead someone to thing that those neurons are different for each tokens but they are not. only activations.Regardless, this is an excellent video.

Are "bias" parts of speech "adjectives" and "adverbs?"

the one piece is real

ITS REAL

- Counting parameters

at the parameters for the FF network are counted. Are these the parameters for the FF network of 1 token? If so, does this mean that the total number of parameters, including shared parameters, is much higher?

Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.

- Superposition

the superposition chapter is great... Watch it guys n girls

@ Re: Superposition - dimensions not being completely independent but rather related.Here's a way to understand superposition that imo was not really clear in the video.

Where could someone find the source material or "footnotes/bibliography" found for each LLM's main base for facts and standardized information deemed "valid" by independent accredited main international sources or bodies of information?

Is there a way to estimate the amount of additional “dimensions” you get by having 89-91 degrees versus 90 degrees

this part is really cool!

.. such that all the vectors are orthogonal is illuminating (~).It suggests that the surface area of the n-dim sphere being partitioned into a vast quantity of locally flat 'Gaussians' (central limit;-) of similarity directions.Once you have that, plus the layer depth to discriminate conceptual level, one gets to see how it works, though doesn't have any explanatory capability because its vocabulary (numeric vectors) does not bake in the human explanatory phrasings we use (all very 'physician heal thyself' given it's an LLM!)

can't i make it in JavaScript? ^^

Important correction: There's an error in the scrappy code I was demoing around , such that in fact not all pairs of vectors end up in that (89°, 91°) range. A few pairs get shot out to have dot products near ±1, hiding in the wings of the plot. I was using a bad cost function that didn't appreciably punish those cases. On closer inspection, it appears not to be possible to get 100k vectors in 100d to be as "nearly orthogonal" as this. 100 dimensions seems to be too low, at least for the (89°, 91°), for the Johnson-Lindenstrauss lemma to really kick in.

a bit unclear how addition of noise to "vectors perpendicularity" can create space for additional features .. can somebody help me to understand that ?

another way to imagine is shooting an arrow in space, and shooting a second arrow in 0.001° different direction. The first inch is nothing, nor is the first 20. But as it goes feet and miles out, it'll eventually be so far apart that it's hard to believe they came from the same bow.Also chaotic pendulums, such as a pendulum on the end of a pendulum. Slight changes ends up with completely different movement.

more & more it feels as if current networks are mainly our first bookschelves.

times as many independent ideas." 💥

- Bell LaboratoriesI am currently interning at Bell Laboratories :)A fun fact- Yann Lecun created CNNs at part of his internship at Bell Labs

This reminds me of bloom filters

hey ya'll! 🇺🇿

it seems obvious to me that a superposition would store more data, not because of nearly perpendicular vectors, but because you're effectively moving from a unary system to a higher base. Same reason you can count to 10, or to 1023, on the same amount of fingers

E . Skip to there to understand the issue.. Training GPT-4 in 2022 should have taken around a cool thousand years. Then Huang says something silly: He says "Well, they have used a stack of 8000 H100 GPU's and it only took three months" - forgetting that the H100 was only on the drawing board back in 2022 when GPT-4 was trained. Now read a little about the latest discoveries in Brain sciences and I mean especially focus on N400 and P600.. And you tell me how to explain Dan, Rob, Max and Dennis. I'm gonna leave this up to you, as I'm sure you understand what I'm getting at.

- Up next

also important in the training process is the concept of self supervised learning to harness the mass of unlabelled data (books in nlp)

Holograms coming! :D

uhhh holograms, I'm so excited,and that is on top of a excellent video.I'm amazed how you manage to consistently keep such a high standart :D

I may have learnt something new here. Grant is saying that the skip connection in the MLP is actually enabling the transformation of the original vector into another vector with enriched contextual meaning. Specifically, in , he is saying that via the summation of the skip connection, the MLP has somehow learnt the directional vector to be added onto the original vector "Michael Jordan", to produced a new output vector that adds "basketball" information. I was originally of the impression that skip connections are only to combat vanishing gradient, and expedite learning. But now Grant is emphasizing it is doing much more!

- End of Harriet Nembhard's introduction

- The cliché

putting a face with a voice, First has always not what I imagined when listening/watching your videos! Second, from here on out when watching, I will always see 3B1B narrating his videos in a cap and gown!

Following your dreams requires more than just passion.

At ... I love how, after Grant points out the 'nerdiness' of the audience, continuing to (joking) say: "... in the vector space of all possible advice", the camera centers on two lovely nerds sitting, without batting an eyelid, thinking: "yeees? ...".

- The shifting goal

Following your dreams requires pragmatic concerns beyond inspiration.

a guy in the audience sighs after hearing that the goal of their life changes today. You can see the stress growing in the faces of the audience.

Transition from personal growth to adding value to others

- Action precedes motivation

wow. Well that hits hard. I hope I'll remember it.

- Timing

Action precedes motivation in finding a career you love

Survivorship bias affects the advice of pursuing high-risk, high-reward paths.

- Know your influence

"Cut Defence Ties". Glad to see the Student Body Politic has some positive values... 😀 And a cool idea, btw, I've not seen that in the UK, a slogan top of the mortar-board. But speaking with a tassle??? I'm not sure they do that here, either... 🤔

Success is a function of the value you bring to others

lol my calculus teacher told me I should consider a double major or at least a minor in math. I took his advice as a compliment, and completely ignored it. Now I am unemployed training to become a data scientist! Trust those who have more experience than you!

that's such an amazing mentality.

- Anticipate change

"CUT DEFENSE TIES" in the audience hell yeah

Influence the dreams of others and be adaptable to change

This is the best commencement speech I've ever heard. Also, the audience reaction at cracked me up.

, like me so that I can listen this part again

"following not the dreams but the opportunities"

the dude who greets and makes you comfortable in unknown party

Love the dude at showing his enthusiasm for the class of 2024

- Recap on embeddings

* - Transformers are key components of large language models, introduced in the 2017 paper "Attention is All You Need".

*🔍 Understanding the Attention Mechanism in Transformers*- Introduction to the attention mechanism and its significance in large language models.- Overview of the goal of transformer models to predict the next word in a piece of text.- Explanation of breaking text into tokens, associating tokens with vectors, and the use of high-dimensional embeddings to encode semantic meaning.

- Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Crafted by Merlin AI. Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Transformers use attention mechanisms to process and associate tokens with semantic meaning.

* - The model aims to predict the next word in a sequence by processing input text broken down into tokens (often words or parts of words).

* - Each token is associated with a high-dimensional vector called an embedding, where directions in this space correspond to semantic meaning.

It is in Sydney right now and I’m up late watching your video from my bed. I should probably get some sleep, I have morning classes, it’s just your content is to God damned interesting. Plus, I’m a teenager. I can’t be separated from my phone accept by 16th century French style beheading. POST MORE VIDEOS! If I can’t sleep you shouldn’t get the luxury!

* - Transformers progressively adjust embeddings to encode rich contextual meaning beyond individual words.

- Motivating examples

* - The attention mechanism can be challenging to grasp initially.

* - Examples like "mole" in different contexts highlight the need for context-aware embeddings, as the initial embedding is the same regardless of context.

*🧠 Contextual meaning refinement in Transformers*- Illustration of how attention mechanisms refine embeddings to encode rich contextual meaning.- Examples showcasing the updating of word embeddings based on context.- Importance of attention blocks in enriching word embeddings with contextual information.

IMP : @, @

- Attention blocks refine word meanings based on context

Attention blocks refine word meanings based on context

* - Attention refines embeddings based on surrounding words; for instance, "tower" becomes more specific when preceded by "Eiffel".

Slightly disappointed you chose not to describe this update as moving the vector to be more "French-wards"

it's actually wrought iron not steel

Thank you for the information given at , it cleared my doubt from the previous video

* - A simplified example with the phrase "a fluffy blue creature roamed the verdant forest" demonstrates how adjectives update nouns through attention.

- The attention pattern

* - Each word's initial embedding encodes its meaning and position.
![Thank you very much for this very informative series on LLMs. I have a small question regarding the matrix dimensions though. @, we have that N_E = 12.288 is the embedding dimension. @6.40, we have that N_Q = 128 is the query embedding dimension; and so is N_K = 128, the key embedding dimension. So, @1. If the context size is N_C (= 2048 in GPT-3 as you indicate), then matrices Q = [Q_1 ... ] and K = [K_1 ...] each would have size N_Q x N_C. Whatever the size of N_C, the size of Q x K^t would be N_Q x N_Q, i.e., 128 x 128. But @ - Attention in transformers, step-by-step | DL6](https://img.youtube.com/vi/eMlx5fFNoYc/mqdefault.jpg)
Thank you very much for this very informative series on LLMs. I have a small question regarding the matrix dimensions though. @, we have that N_E = 12.288 is the embedding dimension. @6.40, we have that N_Q = 128 is the query embedding dimension; and so is N_K = 128, the key embedding dimension. So, @1. If the context size is N_C (= 2048 in GPT-3 as you indicate), then matrices Q = [Q_1 ... ] and K = [K_1 ...] each would have size N_Q x N_C. Whatever the size of N_C, the size of Q x K^t would be N_Q x N_Q, i.e., 128 x 128. But @

- Transforming embeddings through matrix-vector products and tunable weights in deep learning.

Transforming embeddings through matrix-vector products and tunable weights in deep learning.

*⚙️ Matrix operations and weighted sum in Attention*- Explanation of matrix-vector products and tunable weights in matrix operations.- Introduction to the concept of masked attention for preventing later tokens from influencing earlier ones.- Overview of attention patterns, softmax computations, and relevance weighting in attention mechanisms.

* - Nouns generate "query" vectors to seek relevant adjectives.

I dont get where the Q and K values come from.. is it from the embedings? It is said in the video the Q is like a question about the adjectives but where does it come from mathematically? Is it made up? I failed to understand. at the question the noun is asking is "are there any adjectives sitting in front of me" while there is none, there BEHIND it, not in front, so what is it? we are reading from left to right in 2024 still right? its in the small details that this falls apart for me. then it is said that Question is "SOMEHOW" encoded as another vector.. yeah, so it just magically popped into existence?

Can someone explain me How that questions are generated ? and keys Respond to them ? Im bit confused over there ? Question are predefined ? and how keys are created ?

, the matrix W_Q must have size N_Q x N_E; and so would the matrix W_K. So, each Q_i = W_Q x E_i and K_j = W_K x E_j would have dimension N_Q x

I love to see that you make column vectors the embeddings! Machine learning people love designating *row* vectors as embeddings/queries/keys/etc. (including the Attention paper), and this makes all the equations flipped from how we expect in math: Q = EW instead of Q = WE, etc.

can the query ask forwards also? like lets say we have "I saw a creature, it was huge and foul, it was eating grass" should on some level produce a similar result to "The huge and foul creature I saw, was eating grass" and the only way they'd seem similar is if the "creature" can query both forwards and backwards.

are positions of left and right tensors of multiplications somehow swapped? also many other places like the row & column of mask matrix

interesting so an llm could also use following words to fill out the information in the middle of a text?

- Also at , the earlier dimensional size of 128-bit for the Q, K spaces is only for multiple heads (implicitly 96 heads in this example), whereas later you correctly switch back to 12288 dimensions

So when you hear that developers of AI don't know what AI is doing internally is that referring to how the attention layers are placing the vectors in a tensor? Is there more to it? The media makes it sound mysterious and potentially dangerous, but really it's just the method used to assign a high dimensional coordinate to a token within the context of English language.

- Transformers use key matrix to match queries and measure relevance.

Transformers use key matrix to match queries and measure relevance.

* - "Key" vectors are created for each word and compared with queries using dot products to assess relevance.

* - The resulting grid of dot products, after softmax normalization, represents the attention pattern, indicating how each word relates to others.

Great video as always! Minor quibble at , I have always heard and understood “attend to” as being from the perspective of the query (the video uses the key’s perspective) so it would be “the embedding of creature attends to fluffy and blue” instead. It doesn’t really matter since the dot product is symmetric, I just haven’t heard it used colloquially that direction (maybe due to the axis that the softmax is applied on?)

Soft max.Eye opening

Amazing work; thank you for doing it. Now, am I misunderstanding something, or is there possibly a mistake at in the "roamed" column? The weight for the word "the" is 0.99 even though it appears _after_ "roamed" in the context. This frightens me, as math can't ever be wrong.

Possible error at (): *Q_i* and *K_j* should be _row vectors_ so that *{QKᵀ}_{i,j} = Q_i ⋅ K_j* is their dot product.

Very good video, just a small question: - If you’re treating vectors as column vectors from a math perspective, shouldn’t it be Vsoftmax(KᵗQ)?? The original paper puts V on right side and uses softmax(QKᵗ)V because i think it assumes row vectors by default which makes more sense from a computing perspective due to memory efficiency.

- Attention mechanism ensures no later words influence earlier words

Attention mechanism ensures no later words influence earlier words

- Masking

* - During training, the model predicts the next token for various subsequences, requiring masking to prevent future tokens from influencing past predictions.

* - Masking sets irrelevant attention pattern entries to negative infinity before softmax, resulting in zeros after normalization.

(usually set upper right corner to -inf)

* - Attention pattern size scales with the square of the context size, making larger contexts computationally expensive.

- Context size

Motivation for Masking not entirely clear, will need to rewatch it to better understand

, you say that the size of Q x K^t matrix is N_C x N_C. Can you please explain this discrepancy? This also leads to another problem: We need to multiple Q x K^t by V. So, what would be the size of V be? Thank you very much.

- Attention mechanism variations aim at making context more scalable.

Attention mechanism variations aim at making context more scalable.

* - A "value" matrix determines how embeddings should be updated based on relevance.

- Values

Can you explaine me, when you added the matrix W, what are the values in it? In video only says that you need to multiply by these values, but what are the values initially?

How does the attention mechanism avoid getting caught in a sort of loop? For example, in the expression "fluffy creature", "fluffy" clearly modifies "creature", i.e. "creature" as in "fluffy creature" as opposed to "spiky creature". However, the specific noun in question also modifies the meaning of the adjective. For example, "fluffy" as in "fluffy creature" is not the same as "fluffy" as in "fluffy argument". In a sense, humans evaluate these things quite atomically. Is there a sort of back-and-forth iteration that exits after a certain point? If so, on what criteria?

* - Value vectors are added to embeddings based on the attention pattern weights, refining the meaning of words based on context.

*if* this word is relevant to adjusting the meaning of something else...

@ Shouldn't the main diagonal in the attention pattern matrix (query-key dot product) also be zero, i.e. a word cannot give additional context to update its own embedding?

for video content from to 4 is already the output after undergoing self attention mechanism. From the matrix, it can also be seen that the attention weights at most diagonal positions are 1 or close to 1. So, why do we still need E4+Δ E4?

? I personally believe that Δ E

At when describing the updating of a given embedding vector with the preceding embeddings selected for by the attention mechanism, I'm not understanding the need for transforming them to value vectors. What does this EiWv=Vi transformation provide that simply taking the attention discounted sum of the Ei's and updating your embedding directly doesnt?

- Transformers use weighted sums to produce refined embeddings from attention

Transformers use weighted sums to produce refined embeddings from attention

- I think there is an error at , where E5 is shown attending to E6 (value 0.99 shown) which is a forward (future) dependency and should be masked (i.e., set to zero).

is this just a matrix multiplication? How do you go from value matrices V and attention scores K^T Q to delta E

) For the content, it seems that Δ E5 should not receive information about V6, as ⊿E5 can only receive information about V1-V5 at most. Why is it Δ E5=0.9 * V6 in the video?Thank you very much!

Just something I didn't fully understand, in it says to the deltas (computed by the attention) are added to the context-free word embeddings to create an in-context embedding.Where is this addition taking place? did not managed to see where it is located in the attention is all you need paper.

At , I think it is possible to compact the operation into matrix multiplication, then add the columns to the original word vectors.

* - A single attention head involves key, query, and value matrices, with GPT-3 using a 128-dimensional key/query space and a 12,288-dimensional embedding space.

- Counting parameters

, the cell E

At and

* - Value matrices are factored into "value down" and "value up" matrices to improve efficiency, resulting in approximately 6.3 million parameters per head.

: is it only due to efficiency as you said in , or is there also an intuitive reason that the rank (degrees of freedom) of the value map should not be more than the rank of the query and key matrices?

But it seems a there is minor bug in the video at , where the "value down" matrix is expalined - shouldn't the Intermediated result vecior at this point only be 128-element, and not 12,288 as shown? NArration does explain we are mapping to a lower-dimensional space. (and correspondigly, the input to the "Value Up" matricx whould be this 128-size vector, generating a 12,288 size result

-- Love the 3b1b humblebrag here. essentially "Those paper writers make things confusing, and I am here to lead you with knowledge". Thank you Grant for bringing this to all of us!

- Self-attention mechanism explained with parameter count and cross-attention differentiation.

Self-attention mechanism explained with parameter count and cross-attention differentiation.

- doesn't that also mean we're reducing the information in the embedded vectors to the smaller amount of dimensions in the key/query space?

- Cross-attention

* - Cross-attention is a variation used in models processing different data types (e.g., translation), where keys and queries come from separate datasets.

At not necessarily - cross attention can work between two sequences of the same modality, like T5. It's just that one sequence is seen as the input or information the model should attend to, and the second sequence is the output.

- Multiple heads

* - Multi-headed attention runs multiple attention heads in parallel to capture various contextual relationships.

GPT-3 Engineers: "So looking at it bro we gotta go ahead and get at least 10,000"

Thank you for this explanation! Not to quibble, but "brakes" spelled incorrectly at .

- Transformers use multi-headed attention to capture different attention patterns

Transformers use multi-headed attention to capture different attention patterns

In the example at , in the “John hits the breaks sharply” the word “break” means to separate into pieces, whereas “brake” refers to a device used for slowing motion. Clearly the word “brake” is appropriate. This in itself presents an interesting problem for the model to address. The context of the inappropriate use of the word “break” must cause the model to effectively “correct” for this error. Can anyone expand on this concept since the use of language by humans is inherently imperfect. Very interesting and informative series of videos.

* - GPT-3 uses 96 heads, each with distinct key, query, and value maps, enabling the model to learn diverse ways context affects meaning.

animation at 🔥

I have a question at , so I have read something in which, let say original embedding is of C dimension, and in multi head of attention block, the output of the head is of A dimension which is C/number of head. for example if c is 48 and we have 3 heads in attention block, each head output would be of 16 dimension. now we cannot possibly add a 48 dimension with 16 dimension.

at , why don't we normalize the varations produced and added by multi attention blocks by dividing the whole sum by the number of blocks (96 right here). In the current situation, I have the feeling that we are adding the variation 96 times more than we need to to the previous embbeding.

At is there any paper reflecting on how many of these attention heads are redundant? e.g logging at training percentage of attention heads that actually contribute to the change of embedding and possibly drop some of these

*🧠 Multi-Headed Attention Mechanism in Transformers*- Explanation of how each attention head has distinct value matrices for producing value vectors.- Introduction to the process of summing proposed changes from different heads to refine embeddings in each position.- Importance of running multiple heads in parallel to capture diverse contextual meanings efficiently.

, you represented one output of the attention layer as E'=deltaE+E . I am wondering where does the deltaE come from? The matrix multiplication already represents weighted sum: V'=atten(Q,K,V)=softmax(.)V. That is, each output vector in V' is the weighted sum of all vectors in V.

* - The proposed changes from each head are summed and added to the original embedding, resulting in a refined embedding.

in , if I am not mistaken, the result from different heads are concatenated to a higher dimension matrix and project back to the original one, instead of simply adding them up together.

I have a question. At , why don't you take the average of all those propesed changes? If you had a lot of attention heads, wouldnt they all together overestimate the change that should be done to the original embedding of a token? Or is this problem automatically fixed by the backpropagation algorithm so that each change calculated by an attention head is lower than it woudltn if been when there was only 1 attention head in the attention block?

- The output matrix

* - In practice, "value up" matrices for all heads are combined into a single "output matrix" for efficiency.

- The

*🛠️ Technical Details in Implementing Value Matrices*- Description of the implementation difference in the value matrices as a single output matrix.- Clarification regarding technical nuances in how value matrices are structured in practice.- Noting the distinction between value down and value up matrices commonly seen in papers and implementations.

- Implementation of attention differs in practice

Implementation of attention differs in practice

Great video. As usual! Im stuck at the explaination at . the visualization shows that the projection up matrices are concatenated into the output matrix. The explaination says that the concatenated is then multiplied by the output matrix (itself?).if this is a typo and he means "multiply by projection down matrices". how does this work? i remember matrix multiplication only working if the dimensions match. like (n x m) * (m x k) where m has to be the same dimension.. Thanks!

- Going deeper

* - Data flows through multiple attention blocks and other operations, allowing for increasingly nuanced and abstract encoding of information.

Overall very good explanation, just one question: I saw the animation like many times in chapter 5 and 6, it is showing later words are updating earlier words. But since you explicitly mentioned the masking in the video and the pinned comment, I am confused. I am leaning towards its a typo. Same as 6E5 should be masked as 0.00 but it shows as 0.99.

Thanks for the video! Does anybody know why the glowing attention lines were drawn going both ways (e. g. ), when we chop off the lower part of the attention matrix? Shouldn't this mean that the lines should only go forward (to the right)?

*💡 Embedding Nuances and Capacity for Higher-Level Encoding*- Discussion on how embeddings become more nuanced as data flows through multiple transformers and layers.- Exploration of the capacity of transformers to encode complex concepts beyond surface-level descriptors.- Overview of the network parameters associated with attention heads and the total parameters devoted to the entire transformer model.

One question concerning .Does every new vector added to the initial meaning of "one" represent the new learned more refined meaning for each attention head or each attention layer?I think it is each layer, but on the other side, every attention head seems to learn a different way how context changes meaning, so it could be both..

* - GPT-3's 96 layers contain about 58 billion parameters devoted to attention heads, representing a significant portion of the total model parameters.

- Attention mechanism's success lies in parallelizability for fast computations.

Attention mechanism's success lies in parallelizability for fast computations.

Attention mechanism's success lies in parallelizability for fast computations.Crafted by Merlin AI.

- Ending

* - The success of attention is partly due to its parallelizability, enabling efficient computation with GPUs.

Love this. If you ever edit this again, at , “brakes” is misspelled as “breaks”.
