
- Recap on embeddings

* - Transformers are key components of large language models, introduced in the 2017 paper "Attention is All You Need".

*🔍 Understanding the Attention Mechanism in Transformers*- Introduction to the attention mechanism and its significance in large language models.- Overview of the goal of transformer models to predict the next word in a piece of text.- Explanation of breaking text into tokens, associating tokens with vectors, and the use of high-dimensional embeddings to encode semantic meaning.

- Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Crafted by Merlin AI. Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Transformers use attention mechanisms to process and associate tokens with semantic meaning.

* - The model aims to predict the next word in a sequence by processing input text broken down into tokens (often words or parts of words).

* - Each token is associated with a high-dimensional vector called an embedding, where directions in this space correspond to semantic meaning.

It is in Sydney right now and I’m up late watching your video from my bed. I should probably get some sleep, I have morning classes, it’s just your content is to God damned interesting. Plus, I’m a teenager. I can’t be separated from my phone accept by 16th century French style beheading. POST MORE VIDEOS! If I can’t sleep you shouldn’t get the luxury!

* - Transformers progressively adjust embeddings to encode rich contextual meaning beyond individual words.

- Motivating examples

* - The attention mechanism can be challenging to grasp initially.

* - Examples like "mole" in different contexts highlight the need for context-aware embeddings, as the initial embedding is the same regardless of context.

*🧠 Contextual meaning refinement in Transformers*- Illustration of how attention mechanisms refine embeddings to encode rich contextual meaning.- Examples showcasing the updating of word embeddings based on context.- Importance of attention blocks in enriching word embeddings with contextual information.

IMP : @, @

- Attention blocks refine word meanings based on context

Attention blocks refine word meanings based on context

* - Attention refines embeddings based on surrounding words; for instance, "tower" becomes more specific when preceded by "Eiffel".

Slightly disappointed you chose not to describe this update as moving the vector to be more "French-wards"

it's actually wrought iron not steel

Thank you for the information given at , it cleared my doubt from the previous video

* - A simplified example with the phrase "a fluffy blue creature roamed the verdant forest" demonstrates how adjectives update nouns through attention.

- The attention pattern

* - Each word's initial embedding encodes its meaning and position.
![Thank you very much for this very informative series on LLMs. I have a small question regarding the matrix dimensions though. @, we have that N_E = 12.288 is the embedding dimension. @6.40, we have that N_Q = 128 is the query embedding dimension; and so is N_K = 128, the key embedding dimension. So, @1. If the context size is N_C (= 2048 in GPT-3 as you indicate), then matrices Q = [Q_1 ... ] and K = [K_1 ...] each would have size N_Q x N_C. Whatever the size of N_C, the size of Q x K^t would be N_Q x N_Q, i.e., 128 x 128. But @ - Attention in transformers, step-by-step | DL6](https://img.youtube.com/vi/eMlx5fFNoYc/mqdefault.jpg)
Thank you very much for this very informative series on LLMs. I have a small question regarding the matrix dimensions though. @, we have that N_E = 12.288 is the embedding dimension. @6.40, we have that N_Q = 128 is the query embedding dimension; and so is N_K = 128, the key embedding dimension. So, @1. If the context size is N_C (= 2048 in GPT-3 as you indicate), then matrices Q = [Q_1 ... ] and K = [K_1 ...] each would have size N_Q x N_C. Whatever the size of N_C, the size of Q x K^t would be N_Q x N_Q, i.e., 128 x 128. But @

- Transforming embeddings through matrix-vector products and tunable weights in deep learning.

Transforming embeddings through matrix-vector products and tunable weights in deep learning.

*⚙️ Matrix operations and weighted sum in Attention*- Explanation of matrix-vector products and tunable weights in matrix operations.- Introduction to the concept of masked attention for preventing later tokens from influencing earlier ones.- Overview of attention patterns, softmax computations, and relevance weighting in attention mechanisms.

* - Nouns generate "query" vectors to seek relevant adjectives.

I dont get where the Q and K values come from.. is it from the embedings? It is said in the video the Q is like a question about the adjectives but where does it come from mathematically? Is it made up? I failed to understand. at the question the noun is asking is "are there any adjectives sitting in front of me" while there is none, there BEHIND it, not in front, so what is it? we are reading from left to right in 2024 still right? its in the small details that this falls apart for me. then it is said that Question is "SOMEHOW" encoded as another vector.. yeah, so it just magically popped into existence?

Can someone explain me How that questions are generated ? and keys Respond to them ? Im bit confused over there ? Question are predefined ? and how keys are created ?

, the matrix W_Q must have size N_Q x N_E; and so would the matrix W_K. So, each Q_i = W_Q x E_i and K_j = W_K x E_j would have dimension N_Q x

I love to see that you make column vectors the embeddings! Machine learning people love designating *row* vectors as embeddings/queries/keys/etc. (including the Attention paper), and this makes all the equations flipped from how we expect in math: Q = EW instead of Q = WE, etc.

can the query ask forwards also? like lets say we have "I saw a creature, it was huge and foul, it was eating grass" should on some level produce a similar result to "The huge and foul creature I saw, was eating grass" and the only way they'd seem similar is if the "creature" can query both forwards and backwards.

are positions of left and right tensors of multiplications somehow swapped? also many other places like the row & column of mask matrix

interesting so an llm could also use following words to fill out the information in the middle of a text?

- Also at , the earlier dimensional size of 128-bit for the Q, K spaces is only for multiple heads (implicitly 96 heads in this example), whereas later you correctly switch back to 12288 dimensions

So when you hear that developers of AI don't know what AI is doing internally is that referring to how the attention layers are placing the vectors in a tensor? Is there more to it? The media makes it sound mysterious and potentially dangerous, but really it's just the method used to assign a high dimensional coordinate to a token within the context of English language.

- Transformers use key matrix to match queries and measure relevance.

Transformers use key matrix to match queries and measure relevance.

* - "Key" vectors are created for each word and compared with queries using dot products to assess relevance.

* - The resulting grid of dot products, after softmax normalization, represents the attention pattern, indicating how each word relates to others.

Great video as always! Minor quibble at , I have always heard and understood “attend to” as being from the perspective of the query (the video uses the key’s perspective) so it would be “the embedding of creature attends to fluffy and blue” instead. It doesn’t really matter since the dot product is symmetric, I just haven’t heard it used colloquially that direction (maybe due to the axis that the softmax is applied on?)

Soft max.Eye opening

Amazing work; thank you for doing it. Now, am I misunderstanding something, or is there possibly a mistake at in the "roamed" column? The weight for the word "the" is 0.99 even though it appears _after_ "roamed" in the context. This frightens me, as math can't ever be wrong.

Possible error at (): *Q_i* and *K_j* should be _row vectors_ so that *{QKᵀ}_{i,j} = Q_i ⋅ K_j* is their dot product.

Very good video, just a small question: - If you’re treating vectors as column vectors from a math perspective, shouldn’t it be Vsoftmax(KᵗQ)?? The original paper puts V on right side and uses softmax(QKᵗ)V because i think it assumes row vectors by default which makes more sense from a computing perspective due to memory efficiency.

- Attention mechanism ensures no later words influence earlier words

Attention mechanism ensures no later words influence earlier words

- Masking

* - During training, the model predicts the next token for various subsequences, requiring masking to prevent future tokens from influencing past predictions.

* - Masking sets irrelevant attention pattern entries to negative infinity before softmax, resulting in zeros after normalization.

(usually set upper right corner to -inf)

* - Attention pattern size scales with the square of the context size, making larger contexts computationally expensive.

- Context size

Motivation for Masking not entirely clear, will need to rewatch it to better understand

, you say that the size of Q x K^t matrix is N_C x N_C. Can you please explain this discrepancy? This also leads to another problem: We need to multiple Q x K^t by V. So, what would be the size of V be? Thank you very much.

- Attention mechanism variations aim at making context more scalable.

Attention mechanism variations aim at making context more scalable.

* - A "value" matrix determines how embeddings should be updated based on relevance.

- Values

Can you explaine me, when you added the matrix W, what are the values in it? In video only says that you need to multiply by these values, but what are the values initially?

How does the attention mechanism avoid getting caught in a sort of loop? For example, in the expression "fluffy creature", "fluffy" clearly modifies "creature", i.e. "creature" as in "fluffy creature" as opposed to "spiky creature". However, the specific noun in question also modifies the meaning of the adjective. For example, "fluffy" as in "fluffy creature" is not the same as "fluffy" as in "fluffy argument". In a sense, humans evaluate these things quite atomically. Is there a sort of back-and-forth iteration that exits after a certain point? If so, on what criteria?

* - Value vectors are added to embeddings based on the attention pattern weights, refining the meaning of words based on context.

*if* this word is relevant to adjusting the meaning of something else...

@ Shouldn't the main diagonal in the attention pattern matrix (query-key dot product) also be zero, i.e. a word cannot give additional context to update its own embedding?

for video content from to 4 is already the output after undergoing self attention mechanism. From the matrix, it can also be seen that the attention weights at most diagonal positions are 1 or close to 1. So, why do we still need E4+Δ E4?

? I personally believe that Δ E

At when describing the updating of a given embedding vector with the preceding embeddings selected for by the attention mechanism, I'm not understanding the need for transforming them to value vectors. What does this EiWv=Vi transformation provide that simply taking the attention discounted sum of the Ei's and updating your embedding directly doesnt?

- Transformers use weighted sums to produce refined embeddings from attention

Transformers use weighted sums to produce refined embeddings from attention

- I think there is an error at , where E5 is shown attending to E6 (value 0.99 shown) which is a forward (future) dependency and should be masked (i.e., set to zero).

is this just a matrix multiplication? How do you go from value matrices V and attention scores K^T Q to delta E

) For the content, it seems that Δ E5 should not receive information about V6, as ⊿E5 can only receive information about V1-V5 at most. Why is it Δ E5=0.9 * V6 in the video?Thank you very much!

Just something I didn't fully understand, in it says to the deltas (computed by the attention) are added to the context-free word embeddings to create an in-context embedding.Where is this addition taking place? did not managed to see where it is located in the attention is all you need paper.

At , I think it is possible to compact the operation into matrix multiplication, then add the columns to the original word vectors.

* - A single attention head involves key, query, and value matrices, with GPT-3 using a 128-dimensional key/query space and a 12,288-dimensional embedding space.

- Counting parameters

, the cell E

At and

* - Value matrices are factored into "value down" and "value up" matrices to improve efficiency, resulting in approximately 6.3 million parameters per head.

: is it only due to efficiency as you said in , or is there also an intuitive reason that the rank (degrees of freedom) of the value map should not be more than the rank of the query and key matrices?

But it seems a there is minor bug in the video at , where the "value down" matrix is expalined - shouldn't the Intermediated result vecior at this point only be 128-element, and not 12,288 as shown? NArration does explain we are mapping to a lower-dimensional space. (and correspondigly, the input to the "Value Up" matricx whould be this 128-size vector, generating a 12,288 size result

-- Love the 3b1b humblebrag here. essentially "Those paper writers make things confusing, and I am here to lead you with knowledge". Thank you Grant for bringing this to all of us!

- Self-attention mechanism explained with parameter count and cross-attention differentiation.

Self-attention mechanism explained with parameter count and cross-attention differentiation.

- doesn't that also mean we're reducing the information in the embedded vectors to the smaller amount of dimensions in the key/query space?

- Cross-attention

* - Cross-attention is a variation used in models processing different data types (e.g., translation), where keys and queries come from separate datasets.

At not necessarily - cross attention can work between two sequences of the same modality, like T5. It's just that one sequence is seen as the input or information the model should attend to, and the second sequence is the output.

- Multiple heads

* - Multi-headed attention runs multiple attention heads in parallel to capture various contextual relationships.

GPT-3 Engineers: "So looking at it bro we gotta go ahead and get at least 10,000"

Thank you for this explanation! Not to quibble, but "brakes" spelled incorrectly at .

- Transformers use multi-headed attention to capture different attention patterns

Transformers use multi-headed attention to capture different attention patterns

In the example at , in the “John hits the breaks sharply” the word “break” means to separate into pieces, whereas “brake” refers to a device used for slowing motion. Clearly the word “brake” is appropriate. This in itself presents an interesting problem for the model to address. The context of the inappropriate use of the word “break” must cause the model to effectively “correct” for this error. Can anyone expand on this concept since the use of language by humans is inherently imperfect. Very interesting and informative series of videos.

* - GPT-3 uses 96 heads, each with distinct key, query, and value maps, enabling the model to learn diverse ways context affects meaning.

animation at 🔥

I have a question at , so I have read something in which, let say original embedding is of C dimension, and in multi head of attention block, the output of the head is of A dimension which is C/number of head. for example if c is 48 and we have 3 heads in attention block, each head output would be of 16 dimension. now we cannot possibly add a 48 dimension with 16 dimension.

at , why don't we normalize the varations produced and added by multi attention blocks by dividing the whole sum by the number of blocks (96 right here). In the current situation, I have the feeling that we are adding the variation 96 times more than we need to to the previous embbeding.

At is there any paper reflecting on how many of these attention heads are redundant? e.g logging at training percentage of attention heads that actually contribute to the change of embedding and possibly drop some of these

*🧠 Multi-Headed Attention Mechanism in Transformers*- Explanation of how each attention head has distinct value matrices for producing value vectors.- Introduction to the process of summing proposed changes from different heads to refine embeddings in each position.- Importance of running multiple heads in parallel to capture diverse contextual meanings efficiently.

, you represented one output of the attention layer as E'=deltaE+E . I am wondering where does the deltaE come from? The matrix multiplication already represents weighted sum: V'=atten(Q,K,V)=softmax(.)V. That is, each output vector in V' is the weighted sum of all vectors in V.

* - The proposed changes from each head are summed and added to the original embedding, resulting in a refined embedding.

in , if I am not mistaken, the result from different heads are concatenated to a higher dimension matrix and project back to the original one, instead of simply adding them up together.

I have a question. At , why don't you take the average of all those propesed changes? If you had a lot of attention heads, wouldnt they all together overestimate the change that should be done to the original embedding of a token? Or is this problem automatically fixed by the backpropagation algorithm so that each change calculated by an attention head is lower than it woudltn if been when there was only 1 attention head in the attention block?

- The output matrix

* - In practice, "value up" matrices for all heads are combined into a single "output matrix" for efficiency.

- The

*🛠️ Technical Details in Implementing Value Matrices*- Description of the implementation difference in the value matrices as a single output matrix.- Clarification regarding technical nuances in how value matrices are structured in practice.- Noting the distinction between value down and value up matrices commonly seen in papers and implementations.

- Implementation of attention differs in practice

Implementation of attention differs in practice

Great video. As usual! Im stuck at the explaination at . the visualization shows that the projection up matrices are concatenated into the output matrix. The explaination says that the concatenated is then multiplied by the output matrix (itself?).if this is a typo and he means "multiply by projection down matrices". how does this work? i remember matrix multiplication only working if the dimensions match. like (n x m) * (m x k) where m has to be the same dimension.. Thanks!

- Going deeper

* - Data flows through multiple attention blocks and other operations, allowing for increasingly nuanced and abstract encoding of information.

Overall very good explanation, just one question: I saw the animation like many times in chapter 5 and 6, it is showing later words are updating earlier words. But since you explicitly mentioned the masking in the video and the pinned comment, I am confused. I am leaning towards its a typo. Same as 6E5 should be masked as 0.00 but it shows as 0.99.

Thanks for the video! Does anybody know why the glowing attention lines were drawn going both ways (e. g. ), when we chop off the lower part of the attention matrix? Shouldn't this mean that the lines should only go forward (to the right)?

*💡 Embedding Nuances and Capacity for Higher-Level Encoding*- Discussion on how embeddings become more nuanced as data flows through multiple transformers and layers.- Exploration of the capacity of transformers to encode complex concepts beyond surface-level descriptors.- Overview of the network parameters associated with attention heads and the total parameters devoted to the entire transformer model.

One question concerning .Does every new vector added to the initial meaning of "one" represent the new learned more refined meaning for each attention head or each attention layer?I think it is each layer, but on the other side, every attention head seems to learn a different way how context changes meaning, so it could be both..

* - GPT-3's 96 layers contain about 58 billion parameters devoted to attention heads, representing a significant portion of the total model parameters.

- Attention mechanism's success lies in parallelizability for fast computations.

Attention mechanism's success lies in parallelizability for fast computations.

Attention mechanism's success lies in parallelizability for fast computations.Crafted by Merlin AI.

- Ending

* - The success of attention is partly due to its parallelizability, enabling efficient computation with GPUs.

Love this. If you ever edit this again, at , “brakes” is misspelled as “breaks”.
