Attention mechanism's success lies in parallelizability for fast computations.（00:24:53 - 00:00:02）
Attention in transformers, step-by-step | DL6

Demystifying attention, the key mechanism inside transformers and LLMs.
Instead of sponsored ad reads, these lessons are funded directly by viewers: https://3b1b.co/support
Special thanks to these supporters: https://www.3blue1brown.com/lessons/attention #thanks
An equally valuable form of support is to simply share the videos.

Demystifying self-attention, multiple heads, and cross-attention.
Instead of sponsored ad reads, these lessons are funded directly by viewers: https://3b1b.co/support

The first pass for the translated subtitles here is machine-generated and, therefore, notably imperfect. To contribute edits or fixes, visit https://www.criblate.com

Звуковая дорожка на русском языке: Влад Бурмистров.

------------------

Here are a few other relevant resources

Build a GPT from scratch, by Andrej Karpathy
https://youtu.be/kCc8FmEb1nY

If you want a conceptual understanding of language models from the ground up, @vcubingx just started a short series of videos on the topic:
https://youtu.be/1il-s4mgNdI?si=XaVxj6bsdy3VkgEX

If you're interested in the herculean task of interpreting what these large networks might actually be doing, the Transformer Circuits posts by Anthropic are great. In particular, it was only after reading one of these that I started thinking of the combination of the value and output matrices as being a combined low-rank map from the embedding space to itself, which, at least in my mind, made things much clearer than other sources.
https://transformer-circuits.pub/2021/framework/index.html

Site with exercises related to ML programming and GPTs
https://www.gptandchill.ai/codingproblems

History of language models by Brit Cruise, @ArtOfTheProblem
https://youtu.be/OFS90-FX6pg

An early paper on how directions in embedding spaces have meaning:
https://arxiv.org/pdf/1301.3781.pdf

------------------

Timestamps:
0:00 - Recap on embeddings
1:39 - Motivating examples
4:29 - The attention pattern
11:08 - Masking
12:42 - Context size
13:10 - Values
15:44 - Counting parameters
18:21 - Cross-attention
19:19 - Multiple heads
22:16 - The output matrix
23:19 - Going deeper
24:54 - Ending

------------------

These animations are largely made using a custom Python library, manim. See the FAQ comments here:
https://3b1b.co/faq #manim
https://github.com/3b1b/manim
https://github.com/ManimCommunity/manim/

All code for specific videos is visible here:
https://github.com/3b1b/videos/

The music is by Vincent Rubinetti.
https://www.vincentrubinetti.com
https://vincerubinetti.bandcamp.com/album/the-music-of-3blue1brown
https://open.spotify.com/album/1dVyjwS8FBqXhRunaG5W5u

------------------

3blue1brown is a channel about animating math, in all senses of the word animate. If you're reading the bottom of a video description, I'm guessing you're more interested than the average viewer in lessons here. It would mean a lot to me if you chose to stay up to date on new ones, either by subscribing here on YouTube or otherwise following on whichever platform below you check most regularly.

Mailing list: https://3blue1brown.substack.com
Twitter: https://twitter.com/3blue1brown
Instagram: https://www.instagram.com/3blue1brown
Reddit: https://www.reddit.com/r/3blue1brown
Facebook: https://www.facebook.com/3blue1brown
Patreon: https://patreon.com/3blue1brown
Website: https://www.3blue1brown.com

- Recap on embeddings

Attention in transformers, step-by-step | DL6

2024年04月07日　

00:00:00 - 00:01:39

* - Transformers are key components of large language models, introduced in the 2017 paper "Attention is All You Need".

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:00:00 - 00:00:30

🔍 Understanding the Attention Mechanism in Transformers- Introduction to the attention mechanism and its significance in large language models.- Overview of the goal of transformer models to predict the next word in a piece of text.- Explanation of breaking text into tokens, associating tokens with vectors, and the use of high-dimensional embeddings to encode semantic meaning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @HarpaAI 様　

00:00:00 - 00:02:11

- Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:00:02 - 00:02:38

Crafted by Merlin AI. Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:00:02 - 00:02:38

Transformers use attention mechanisms to process and associate tokens with semantic meaning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:00:02 - 00:02:38

* - The model aims to predict the next word in a sequence by processing input text broken down into tokens (often words or parts of words).

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:00:30 - 00:00:54

* - Each token is associated with a high-dimensional vector called an embedding, where directions in this space correspond to semantic meaning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:00:54 - 00:01:28

It is in Sydney right now and I’m up late watching your video from my bed. I should probably get some sleep, I have morning classes, it’s just your content is to God damned interesting. Plus, I’m a teenager. I can’t be separated from my phone accept by 16th century French style beheading. POST MORE VIDEOS! If I can’t sleep you shouldn’t get the luxury!

Attention in transformers, step-by-step | DL6

2024年04月07日　 @jeremypianofreestyle7210 様　

00:01:00 - 00:26:10

* - Transformers progressively adjust embeddings to encode rich contextual meaning beyond individual words.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:01:28 - 00:01:40

- Motivating examples

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:01:39 - 00:04:29

* - The attention mechanism can be challenging to grasp initially.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:01:40 - 00:02:00

* - Examples like "mole" in different contexts highlight the need for context-aware embeddings, as the initial embedding is the same regardless of context.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:02:00 - 00:02:57

🧠 Contextual meaning refinement in Transformers- Illustration of how attention mechanisms refine embeddings to encode rich contextual meaning.- Examples showcasing the updating of word embeddings based on context.- Importance of attention blocks in enriching word embeddings with contextual information.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @HarpaAI 様　

00:02:11 - 00:05:37

* - Attention refines embeddings based on surrounding words; for instance, "tower" becomes more specific when preceded by "Eiffel".

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:02:57 - 00:04:29

Slightly disappointed you chose not to describe this update as moving the vector to be more "French-wards"

Attention in transformers, step-by-step | DL6

2024年04月07日　 @bosstowndynamics5488 様　

00:03:10 - 00:26:10

it's actually wrought iron not steel

Attention in transformers, step-by-step | DL6

2024年04月07日　 @danielkrajnik3817 様　

00:03:10 - 00:26:10

Thank you for the information given at , it cleared my doubt from the previous video

Attention in transformers, step-by-step | DL6

2024年04月07日　 @Aarav-p5i3o 様　

00:04:26 - 00:26:10

* - A simplified example with the phrase "a fluffy blue creature roamed the verdant forest" demonstrates how adjectives update nouns through attention.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:04:29 - 00:04:56

- The attention pattern

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:04:29 - 00:11:08

* - Each word's initial embedding encodes its meaning and position.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:04:56 - 00:06:14

Thank you very much for this very informative series on LLMs. I have a small question regarding the matrix dimensions though. @, we have that N_E = 12.288 is the embedding dimension. @6.40, we have that N_Q = 128 is the query embedding dimension; and so is N_K = 128, the key embedding dimension. So, @1. If the context size is N_C (= 2048 in GPT-3 as you indicate), then matrices Q = [Q_1 ... ] and K = [K_1 ...] each would have size N_Q x N_C. Whatever the size of N_C, the size of Q x K^t would be N_Q x N_Q, i.e., 128 x 128. But @

Attention in transformers, step-by-step | DL6

2024年04月07日　 @kpremaratne 様　

00:05:01 - 00:06:50

- Transforming embeddings through matrix-vector products and tunable weights in deep learning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:05:15 - 00:07:47

Transforming embeddings through matrix-vector products and tunable weights in deep learning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:05:15 - 00:07:47

⚙️ Matrix operations and weighted sum in Attention- Explanation of matrix-vector products and tunable weights in matrix operations.- Introduction to the concept of masked attention for preventing later tokens from influencing earlier ones.- Overview of attention patterns, softmax computations, and relevance weighting in attention mechanisms.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @HarpaAI 様　

00:05:37 - 00:21:31

* - Nouns generate "query" vectors to seek relevant adjectives.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:06:14 - 00:07:51

I dont get where the Q and K values come from.. is it from the embedings? It is said in the video the Q is like a question about the adjectives but where does it come from mathematically? Is it made up? I failed to understand. at the question the noun is asking is "are there any adjectives sitting in front of me" while there is none, there BEHIND it, not in front, so what is it? we are reading from left to right in 2024 still right? its in the small details that this falls apart for me. then it is said that Question is "SOMEHOW" encoded as another vector.. yeah, so it just magically popped into existence?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @laodrofotic7713 様　

00:06:23 - 00:26:10

Can someone explain me How that questions are generated ? and keys Respond to them ? Im bit confused over there ? Question are predefined ? and how keys are created ?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @curiosityspace8635 様　

00:06:30 - 00:26:10

, the matrix W_Q must have size N_Q x N_E; and so would the matrix W_K. So, each Q_i = W_Q x E_i and K_j = W_K x E_j would have dimension N_Q x

Attention in transformers, step-by-step | DL6

2024年04月07日　 @kpremaratne 様　

00:06:50 - 00:12:45

I love to see that you make column vectors the embeddings! Machine learning people love designating row vectors as embeddings/queries/keys/etc. (including the Attention paper), and this makes all the equations flipped from how we expect in math: Q = EW instead of Q = WE, etc.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @phlaxyr 様　

00:06:55 - 00:26:10

can the query ask forwards also? like lets say we have "I saw a creature, it was huge and foul, it was eating grass" should on some level produce a similar result to "The huge and foul creature I saw, was eating grass" and the only way they'd seem similar is if the "creature" can query both forwards and backwards.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @minecraftermad 様　

00:06:55 - 00:26:10

are positions of left and right tensors of multiplications somehow swapped? also many other places like the row & column of mask matrix

Attention in transformers, step-by-step | DL6

2024年04月07日　 @standoasis 様　

00:06:56 - 00:12:39

interesting so an llm could also use following words to fill out the information in the middle of a text?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @nutzeeer 様　

00:07:30 - 00:26:10

- Also at , the earlier dimensional size of 128-bit for the Q, K spaces is only for multiple heads (implicitly 96 heads in this example), whereas later you correctly switch back to 12288 dimensions

Attention in transformers, step-by-step | DL6

2024年04月07日　 @broccoli3757 様　

00:07:32 - 00:26:10

So when you hear that developers of AI don't know what AI is doing internally is that referring to how the attention layers are placing the vectors in a tensor? Is there more to it? The media makes it sound mysterious and potentially dangerous, but really it's just the method used to assign a high dimensional coordinate to a token within the context of English language.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @j.f.c 様　

00:07:43 - 00:26:10

- Transformers use key matrix to match queries and measure relevance.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:07:47 - 00:10:31

Transformers use key matrix to match queries and measure relevance.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:07:47 - 00:10:31

* - "Key" vectors are created for each word and compared with queries using dot products to assess relevance.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:07:51 - 00:08:59

* - The resulting grid of dot products, after softmax normalization, represents the attention pattern, indicating how each word relates to others.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:08:59 - 00:11:09

Great video as always! Minor quibble at , I have always heard and understood “attend to” as being from the perspective of the query (the video uses the key’s perspective) so it would be “the embedding of creature attends to fluffy and blue” instead. It doesn’t really matter since the dot product is symmetric, I just haven’t heard it used colloquially that direction (maybe due to the axis that the softmax is applied on?)

Attention in transformers, step-by-step | DL6

2024年04月07日　 @seblund 様　

00:09:00 - 00:26:10

Soft max.Eye opening

Attention in transformers, step-by-step | DL6

2024年04月07日　 @pavanreddy4611 様　

00:09:54 - 00:26:10

Amazing work; thank you for doing it. Now, am I misunderstanding something, or is there possibly a mistake at in the "roamed" column? The weight for the word "the" is 0.99 even though it appears _after_ "roamed" in the context. This frightens me, as math can't ever be wrong.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @chrismontanaro7155 様　

00:10:10 - 00:26:10

$Possible error at (): *Q_i* and *K_j* should be _row vectors_ so that *{QKᵀ}_{i,j} = Q_i ⋅ K_j* is their dot product. - Attention in transformers, step-by-step | DL6$

Possible error at (): Q_i and K_j should be _row vectors_ so that {QKᵀ}_{i,j} = Q_i ⋅ K_j is their dot product.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @muntoonxt 様　

00:10:30 - 00:26:10

Very good video, just a small question: - If you’re treating vectors as column vectors from a math perspective, shouldn’t it be Vsoftmax(KᵗQ)?? The original paper puts V on right side and uses softmax(QKᵗ)V because i think it assumes row vectors by default which makes more sense from a computing perspective due to memory efficiency.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @ZhifanSong 様　

00:10:30 - 00:26:10

- Attention mechanism ensures no later words influence earlier words

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:10:31 - 00:12:55

Attention mechanism ensures no later words influence earlier words

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:10:31 - 00:12:55

- Masking

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:11:08 - 00:12:42

* - During training, the model predicts the next token for various subsequences, requiring masking to prevent future tokens from influencing past predictions.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:11:09 - 00:12:16

* - Masking sets irrelevant attention pattern entries to negative infinity before softmax, resulting in zeros after normalization.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:12:16 - 00:12:42

(usually set upper right corner to -inf)

Attention in transformers, step-by-step | DL6

2024年04月07日　 @standoasis 様　

00:12:39 - 00:26:10

* - Attention pattern size scales with the square of the context size, making larger contexts computationally expensive.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:12:42 - 00:13:10

- Context size

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:12:42 - 00:13:10

Motivation for Masking not entirely clear, will need to rewatch it to better understand

Attention in transformers, step-by-step | DL6

2024年04月07日　 @INGLERAJKAMALRAJENDRA 様　

00:12:43 - 00:26:10

, you say that the size of Q x K^t matrix is N_C x N_C. Can you please explain this discrepancy? This also leads to another problem: We need to multiple Q x K^t by V. So, what would be the size of V be? Thank you very much.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @kpremaratne 様　

00:12:45 - 00:26:10

- Attention mechanism variations aim at making context more scalable.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:12:55 - 00:15:26

Attention mechanism variations aim at making context more scalable.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:12:55 - 00:15:26

* - A "value" matrix determines how embeddings should be updated based on relevance.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:13:10 - 00:14:07

- Values

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:13:10 - 00:15:44

Can you explaine me, when you added the matrix W, what are the values in it? In video only says that you need to multiply by these values, but what are the values initially?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user--------- 様　

00:13:10 - 00:26:10

How does the attention mechanism avoid getting caught in a sort of loop? For example, in the expression "fluffy creature", "fluffy" clearly modifies "creature", i.e. "creature" as in "fluffy creature" as opposed to "spiky creature". However, the specific noun in question also modifies the meaning of the adjective. For example, "fluffy" as in "fluffy creature" is not the same as "fluffy" as in "fluffy argument". In a sense, humans evaluate these things quite atomically. Is there a sort of back-and-forth iteration that exits after a certain point? If so, on what criteria?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @simonr-vp4if 様　

00:13:34 - 00:26:10

* - Value vectors are added to embeddings based on the attention pattern weights, refining the meaning of words based on context.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:14:07 - 00:15:44

if this word is relevant to adjusting the meaning of something else...

Attention in transformers, step-by-step | DL6

2024年04月07日　 @tornyu 様　

00:14:10 - 00:26:10

@ Shouldn't the main diagonal in the attention pattern matrix (query-key dot product) also be zero, i.e. a word cannot give additional context to update its own embedding?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @accident_prone 様　

00:14:24 - 00:26:10

for video content from to 4 is already the output after undergoing self attention mechanism. From the matrix, it can also be seen that the attention weights at most diagonal positions are 1 or close to 1. So, why do we still need E4+Δ E4?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @GuadalupeLee-sr8fi 様　

00:15:13 - 00:15:20

? I personally believe that Δ E

Attention in transformers, step-by-step | DL6

2024年04月07日　 @GuadalupeLee-sr8fi 様　

00:15:20 - 00:15:32

At when describing the updating of a given embedding vector with the preceding embeddings selected for by the attention mechanism, I'm not understanding the need for transforming them to value vectors. What does this EiWv=Vi transformation provide that simply taking the attention discounted sum of the Ei's and updating your embedding directly doesnt?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @jacobhm7429 様　

00:15:25 - 00:26:10

- Transformers use weighted sums to produce refined embeddings from attention

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:15:26 - 00:17:58

Transformers use weighted sums to produce refined embeddings from attention

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:15:26 - 00:17:58

- I think there is an error at , where E5 is shown attending to E6 (value 0.99 shown) which is a forward (future) dependency and should be masked (i.e., set to zero).

Attention in transformers, step-by-step | DL6

2024年04月07日　 @broccoli3757 様　

00:15:28 - 00:07:32

is this just a matrix multiplication? How do you go from value matrices V and attention scores K^T Q to delta E

Attention in transformers, step-by-step | DL6

2024年04月07日　 @maruftalukdar1805 様　

00:15:30 - 00:26:10

) For the content, it seems that Δ E5 should not receive information about V6, as ⊿E5 can only receive information about V1-V5 at most. Why is it Δ E5=0.9 * V6 in the video?Thank you very much!

Attention in transformers, step-by-step | DL6

2024年04月07日　 @GuadalupeLee-sr8fi 様　

00:15:32 - 00:26:10

Just something I didn't fully understand, in it says to the deltas (computed by the attention) are added to the context-free word embeddings to create an in-context embedding.Where is this addition taking place? did not managed to see where it is located in the attention is all you need paper.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @itamarhadad1994 様　

00:15:36 - 00:26:10

At , I think it is possible to compact the operation into matrix multiplication, then add the columns to the original word vectors.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @brightlin777 様　

00:15:41 - 00:26:10

* - A single attention head involves key, query, and value matrices, with GPT-3 using a 128-dimensional key/query space and a 12,288-dimensional embedding space.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:15:44 - 00:16:45

- Counting parameters

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:15:44 - 00:18:21

, the cell E

Attention in transformers, step-by-step | DL6

2024年04月07日　 @nanxlu 様　

00:15:45 - 00:26:10

At and

Attention in transformers, step-by-step | DL6

2024年04月07日　 @VincentYCYao 様　

00:15:49 - 00:21:35

* - Value matrices are factored into "value down" and "value up" matrices to improve efficiency, resulting in approximately 6.3 million parameters per head.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:16:45 - 00:18:22

: is it only due to efficiency as you said in , or is there also an intuitive reason that the rank (degrees of freedom) of the value map should not be more than the rank of the query and key matrices?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @rockymandayam9240 様　

00:16:55 - 00:26:10

But it seems a there is minor bug in the video at , where the "value down" matrix is expalined - shouldn't the Intermediated result vecior at this point only be 128-element, and not 12,288 as shown? NArration does explain we are mapping to a lower-dimensional space. (and correspondigly, the input to the "Value Up" matricx whould be this 128-size vector, generating a 12,288 size result

Attention in transformers, step-by-step | DL6

2024年04月07日　 @bluestarwars 様　

00:17:24 - 00:26:10

-- Love the 3b1b humblebrag here. essentially "Those paper writers make things confusing, and I am here to lead you with knowledge". Thank you Grant for bringing this to all of us!

Attention in transformers, step-by-step | DL6

2024年04月07日　 @3Max 様　

00:17:50 - 00:26:10

- Self-attention mechanism explained with parameter count and cross-attention differentiation.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:17:58 - 00:20:08

Self-attention mechanism explained with parameter count and cross-attention differentiation.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:17:58 - 00:20:08

- doesn't that also mean we're reducing the information in the embedded vectors to the smaller amount of dimensions in the key/query space?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @DracarmenWinterspring 様　

00:18:06 - 00:26:10

- Cross-attention

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:18:21 - 00:19:19

* - Cross-attention is a variation used in models processing different data types (e.g., translation), where keys and queries come from separate datasets.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:18:22 - 00:19:20

At not necessarily - cross attention can work between two sequences of the same modality, like T5. It's just that one sequence is seen as the input or information the model should attend to, and the second sequence is the output.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @HoriaCristescu 様　

00:18:40 - 00:26:10

- Multiple heads

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:19:19 - 00:22:16

* - Multi-headed attention runs multiple attention heads in parallel to capture various contextual relationships.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:19:20 - 00:20:52

GPT-3 Engineers: "So looking at it bro we gotta go ahead and get at least 10,000"

Attention in transformers, step-by-step | DL6

2024年04月07日　 @klingeron5929 様　

00:19:25 - 00:26:10

Thank you for this explanation! Not to quibble, but "brakes" spelled incorrectly at .

Attention in transformers, step-by-step | DL6

2024年04月07日　 @SteveRowe 様　

00:20:07 - 00:26:10

- Transformers use multi-headed attention to capture different attention patterns

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:20:08 - 00:22:34

Transformers use multi-headed attention to capture different attention patterns

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:20:08 - 00:22:34

In the example at , in the “John hits the breaks sharply” the word “break” means to separate into pieces, whereas “brake” refers to a device used for slowing motion. Clearly the word “brake” is appropriate. This in itself presents an interesting problem for the model to address. The context of the inappropriate use of the word “break” must cause the model to effectively “correct” for this error. Can anyone expand on this concept since the use of language by humans is inherently imperfect. Very interesting and informative series of videos.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @RobertReynolds-b9p 様　

00:20:22 - 00:26:10

* - GPT-3 uses 96 heads, each with distinct key, query, and value maps, enabling the model to learn diverse ways context affects meaning.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:20:52 - 00:21:36

animation at 🔥

Attention in transformers, step-by-step | DL6

2024年04月07日　 @RamithHettiarachchi 様　

00:20:58 - 00:26:10

I have a question at , so I have read something in which, let say original embedding is of C dimension, and in multi head of attention block, the output of the head is of A dimension which is C/number of head. for example if c is 48 and we have 3 heads in attention block, each head output would be of 16 dimension. now we cannot possibly add a 48 dimension with 16 dimension.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AryanIITIndore 様　

00:21:20 - 00:26:10

at , why don't we normalize the varations produced and added by multi attention blocks by dividing the whole sum by the number of blocks (96 right here). In the current situation, I have the feeling that we are adding the variation 96 times more than we need to to the previous embbeding.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @Turkish_coffee_42 様　

00:21:25 - 00:26:10

At is there any paper reflecting on how many of these attention heads are redundant? e.g logging at training percentage of attention heads that actually contribute to the change of embedding and possibly drop some of these

Attention in transformers, step-by-step | DL6

2024年04月07日　 @ManuelRavasqueira 様　

00:21:27 - 00:26:10

🧠 Multi-Headed Attention Mechanism in Transformers- Explanation of how each attention head has distinct value matrices for producing value vectors.- Introduction to the process of summing proposed changes from different heads to refine embeddings in each position.- Importance of running multiple heads in parallel to capture diverse contextual meanings efficiently.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @HarpaAI 様　

00:21:31 - 00:22:34

, you represented one output of the attention layer as E'=deltaE+E . I am wondering where does the deltaE come from? The matrix multiplication already represents weighted sum: V'=atten(Q,K,V)=softmax(.)V. That is, each output vector in V' is the weighted sum of all vectors in V.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @VincentYCYao 様　

00:21:35 - 00:26:10

* - The proposed changes from each head are summed and added to the original embedding, resulting in a refined embedding.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:21:36 - 00:22:16

in , if I am not mistaken, the result from different heads are concatenated to a higher dimension matrix and project back to the original one, instead of simply adding them up together.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @nickname8668 様　

00:21:40 - 00:26:10

I have a question. At , why don't you take the average of all those propesed changes? If you had a lot of attention heads, wouldnt they all together overestimate the change that should be done to the original embedding of a token? Or is this problem automatically fixed by the backpropagation algorithm so that each change calculated by an attention head is lower than it woudltn if been when there was only 1 attention head in the attention block?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @jjksounds5250 様　

00:21:47 - 00:26:10

- The output matrix

Attention in transformers, step-by-step | DL6

2024年04月07日　

00:22:16 - 00:23:19

* - In practice, "value up" matrices for all heads are combined into a single "output matrix" for efficiency.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:22:16 - 00:23:19

- The

Attention in transformers, step-by-step | DL6

2024年04月07日　 @AISmartEdge 様　

00:22:16 - 00:26:10

🛠️ Technical Details in Implementing Value Matrices- Description of the implementation difference in the value matrices as a single output matrix.- Clarification regarding technical nuances in how value matrices are structured in practice.- Noting the distinction between value down and value up matrices commonly seen in papers and implementations.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @HarpaAI 様　

00:22:34 - 00:24:03

- Implementation of attention differs in practice

Attention in transformers, step-by-step | DL6

2024年04月07日　 @NithinKandula 様　

00:22:34 - 00:24:53

Implementation of attention differs in practice

Attention in transformers, step-by-step | DL6

2024年04月07日　 @user-yl7sv2ec7y 様　

00:22:34 - 00:24:53

Great video. As usual! Im stuck at the explaination at . the visualization shows that the projection up matrices are concatenated into the output matrix. The explaination says that the concatenated is then multiplied by the output matrix (itself?).if this is a typo and he means "multiply by projection down matrices". how does this work? i remember matrix multiplication only working if the dimensions match. like (n x m) * (m x k) where m has to be the same dimension.. Thanks!

Attention in transformers, step-by-step | DL6

2024年04月07日　 @legitqs4098 様　

00:23:10 - 00:26:10

- Going deeper

Attention in transformers, step-by-step | DL6

2024年04月07日　

00:23:19 - 00:24:54

* - Data flows through multiple attention blocks and other operations, allowing for increasingly nuanced and abstract encoding of information.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:23:19 - 00:24:16

Overall very good explanation, just one question: I saw the animation like many times in chapter 5 and 6, it is showing later words are updating earlier words. But since you explicitly mentioned the masking in the video and the pinned comment, I am confused. I am leaning towards its a typo. Same as 6E5 should be masked as 0.00 but it shows as 0.99.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @nanxlu 様　

00:23:20 - 00:15:45

Thanks for the video! Does anybody know why the glowing attention lines were drawn going both ways (e. g. ), when we chop off the lower part of the attention matrix? Shouldn't this mean that the lines should only go forward (to the right)?

Attention in transformers, step-by-step | DL6

2024年04月07日　 @StepanKorney 様　

00:23:35 - 00:26:10

💡 Embedding Nuances and Capacity for Higher-Level Encoding- Discussion on how embeddings become more nuanced as data flows through multiple transformers and layers.- Exploration of the capacity of transformers to encode complex concepts beyond surface-level descriptors.- Overview of the network parameters associated with attention heads and the total parameters devoted to the entire transformer model.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @HarpaAI 様　

00:24:03 - 00:26:10

One question concerning .Does every new vector added to the initial meaning of "one" represent the new learned more refined meaning for each attention head or each attention layer?I think it is each layer, but on the other side, every attention head seems to learn a different way how context changes meaning, so it could be both..

Attention in transformers, step-by-step | DL6

2024年04月07日　 @JonnyInChina 様　

00:24:12 - 00:26:10

* - GPT-3's 96 layers contain about 58 billion parameters devoted to attention heads, representing a significant portion of the total model parameters.

Attention in transformers, step-by-step | DL6

2024年04月07日　 @wolpumba4099 様　

00:24:16 - 00:24:54