Holograms coming! :D（00:22:10 - 00:22:43）
How might LLMs store facts | DL7

Unpacking the multilayer perceptrons in a transformer, and how they may store facts
Instead of sponsored ad reads, these lessons are funded directly by viewers: https://3b1b.co/support
An equally valuable form of support is to share the videos.

AI Alignment forum post from the Deepmind researchers referenced at the video's start:
https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

Anthropic posts about superposition referenced near the end:
https://transformer-circuits.pub/2022/toy_model/index.html
https://transformer-circuits.pub/2023/monosemantic-features

Some added resources for those interested in learning more about mechanistic interpretability, offered by Neel Nanda

Mechanistic interpretability paper reading list
https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite

Getting started in mechanistic interpretability
https://www.neelnanda.io/mechanistic-interpretability/getting-started

An interactive demo of sparse autoencoders (made by Neuronpedia)
https://www.neuronpedia.org/gemma-scope #main

Coding tutorials for mechanistic interpretability (made by ARENA)
https://arena3-chapter1-transformer-interp.streamlit.app/

Звуковая дорожка на русском языке: Влад Бурмистров.

Sections:
0:00 - Where facts in LLMs live
2:15 - Quick refresher on transformers
4:39 - Assumptions for our toy example
6:07 - Inside a multilayer perceptron
15:38 - Counting parameters
17:04 - Superposition
21:37 - Up next

------------------

These animations are largely made using a custom Python library, manim. See the FAQ comments here:
https://3b1b.co/faq #manim
https://github.com/3b1b/manim
https://github.com/ManimCommunity/manim/

All code for specific videos is visible here:
https://github.com/3b1b/videos/

The music is by Vincent Rubinetti.
https://www.vincentrubinetti.com
https://vincerubinetti.bandcamp.com/album/the-music-of-3blue1brown
https://open.spotify.com/album/1dVyjwS8FBqXhRunaG5W5u

------------------

3blue1brown is a channel about animating math, in all senses of the word animate. If you're reading the bottom of a video description, I'm guessing you're more interested than the average viewer in lessons here. It would mean a lot to me if you chose to stay up to date on new ones, either by subscribing here on YouTube or otherwise following on whichever platform below you check most regularly.

Mailing list: https://3blue1brown.substack.com
Twitter: https://twitter.com/3blue1brown
Instagram: https://www.instagram.com/3blue1brown
Reddit: https://www.reddit.com/r/3blue1brown
Facebook: https://www.facebook.com/3blue1brown
Patreon: https://patreon.com/3blue1brown
Website: https://www.3blue1brown.com

hold, they build the LLM, and don’t know how the facts are cataloged? This is gonna be a doozy.

How might LLMs store facts | DL7

2024年08月31日　 @donewithprecision785 様　

00:00:56 - 00:22:43

- Quick refresher on transformers

How might LLMs store facts | DL7

2024年08月31日　

00:02:15 - 00:04:39

My understanding is that an LLM does not in fact store facts, but through the process of predicting word associations through being trained on an absolutely astoundingly large set of examples, it "stores" the likelihood that the word "basketball" is historically the most likely next word in the series. It doesn't have any grasp of a concept of basketball in any sort of meaningful or even static way. This is exactly the problem I'm trying to solve, and honestly I think I found a solution. I just don't know yet how reliable it is on a large scale, or how economic it is in terms of the required computing power. We'll see.

How might LLMs store facts | DL7

2024年08月31日　 @MrRavaging 様　

00:02:15 - 00:22:43

to

How might LLMs store facts | DL7

2024年08月31日　 @baransam1 様　

00:02:35 - 00:02:45

what would be simplified is half a circumference with two half circumferences inside it, into infinity... that would show some precision...as to how the machine AI is doing this efficiently...while increasing the accuracy of predictions... the selection line on the graph..sequence of vectors lines...attention lines

How might LLMs store facts | DL7

2024年08月31日　 @LordWarden170 様　

00:02:44 - 00:22:43

: The tokens (words) convey context information in each other making the embedding a richer/nuanced version than a simple meaning of the word. When this animation is shown, the arrows are shown moving from a later token to earlier token as well. Isn't this contradictory to the concept introduced during masking where it is said that only earlier words are allowed to enrich the later words. (This is a common animation shown multiple time in this series).

How might LLMs store facts | DL7

2024年08月31日　 @baransam1 様　

00:02:45 - 00:22:43

Only the joker would pick stranger over stronger

How might LLMs store facts | DL7

2024年08月31日　 @robinmitchell6803 様　

00:03:16 - 00:22:43

Live in a "high dimension" Please expand.

How might LLMs store facts | DL7

2024年08月31日　 @TheSpiritualCollective444 様　

00:03:38 - 00:22:43

I was unable to reproduce woman-man ~= aunt-uncle using either OpenAIEmbedding model 'text-embedding-3-small' or the older 'text-embedding-ada-002' model using LangChain. Cosine similarity of 0.29. I tried lots of pairings: aunt-uncle, woman-man, sister-brother, and queen-king. All had cosine similarities in the range 0.29 to 0.38. Happy to share my work if you're curious.

How might LLMs store facts | DL7

2024年08月31日　 @davidmorse5411 様　

00:03:41 - 00:22:43

this subtraction of vector makes me wonder if all of category theory can be described using linear algebra

How might LLMs store facts | DL7

2024年08月31日　 @alejrandom6592 様　

00:03:41 - 00:22:43

If we consider this higher-dimensional embedding space, in which each direction encodes a specific meaning, each vector in this space represents a certain distinct concept, right? (a vector 'meaning' man, woman, uncle, or aunt, as per the example at ).

How might LLMs store facts | DL7

2024年08月31日　 @daantromp5195 様　

00:03:44 - 00:22:43

Kinda like how neural synapses work...when neurons wire together, they fire together and then it tickles one of the adjacent "dormant" neurons and it lights up with a memory like, "Oh yeah! Totally forgot about that until you just mentioned it again to me...." right?

How might LLMs store facts | DL7

2024年08月31日　 @TheSpiritualCollective444 様　

00:03:59 - 00:22:43

I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ?

How might LLMs store facts | DL7

2024年08月31日　 @timeflex 様　

00:04:10 - 00:22:43

- Assumptions for our toy example

How might LLMs store facts | DL7

2024年08月31日　

00:04:39 - 00:06:07

This is unironically how I understand "the spectrum" of autism, for example.

How might LLMs store facts | DL7

2024年08月31日　 @KillianTwew 様　

00:05:08 - 00:22:43

in are you impliying that the vectors are not normalized and therefore a dot product of 1 does not mean they are parallel? So what we call semantic similarity is not a measure of pointing towards the same direction? So it can be that the dot product is 1 in several directions at the same time

How might LLMs store facts | DL7

2024年08月31日　 @enriquebalpstraffon 様　

00:05:10 - 00:22:43

I don't understand the assumptions made around about dot products. Why dot product being 1 is used to mean that the vector encodes that particular direction/concept? I would have thought that the vector needs to be parallel to that concept vector to assume it encodes that concept. But then a vector would only be able to encode one concept. Is this why dot product=1 is just sort of conventionally chosen?

How might LLMs store facts | DL7

2024年08月31日　 @gauravfotedar 様　

00:05:30 - 00:22:43

How can dot product of a vector with both "Michael" and "Jordan" be 1 when earlier it was said that "Michael" and "Jordan" are nearly orthogonal to each other?

How might LLMs store facts | DL7

2024年08月31日　 @vivekrai1974 様　

00:05:53 - 00:22:43

- Inside a multilayer perceptron

How might LLMs store facts | DL7

2024年08月31日　

00:06:07 - 00:15:38

Does that sequence of high-dimension vectors (let’s call it a 1D array) in the MLP behave as its own tensor in the LLM?

How might LLMs store facts | DL7

2024年08月31日　 @mistahtom 様　

00:07:00 - 00:22:43

You are telling me that the AI Is akinator

How might LLMs store facts | DL7

2024年08月31日　 @joaquincurrais4856 様　

00:08:47 - 00:22:43

Who determines "bias" or is it a "vector" with a "code" as well?

How might LLMs store facts | DL7

2024年08月31日　 @TheSpiritualCollective444 様　

00:09:33 - 00:22:43

) otherwise all borders go through

How might LLMs store facts | DL7

2024年08月31日　 @anti-troll-software6151 様　

00:10:30 - 00:22:43

, "...continuing with the deep learning tradition of overly fancy names..." 😂🤣😂

How might LLMs store facts | DL7

2024年08月31日　 @baxile 様　

00:10:49 - 00:22:43

So this is just an "if, then" function?

How might LLMs store facts | DL7

2024年08月31日　 @TheSpiritualCollective444 様　

00:11:33 - 00:22:43

The Bias exists to move the border between yes or no (see 0. It is literally the b in y=mx+b. Without it all y=mx go through (0,0)

How might LLMs store facts | DL7

2024年08月31日　 @anti-troll-software6151 様　

00:13:44 - 00:10:30

So the weights are simultaneously nudged to form vector encodings for output words as columns, and patterns in rows to get the values of how much each column should be used based on multiplication by input?

How might LLMs store facts | DL7

2024年08月31日　 @Valentin-d1j 様　

00:13:53 - 00:22:43

As for the question of what the bias does - it's just a control at what height you put the threshold of the RELU. This way you can clip the data at different values depending on the context.

How might LLMs store facts | DL7

2024年08月31日　 @bzqp2 様　

00:14:00 - 00:22:43

At I think this might be a misinterpretation. The MLP block uses the same 50,000 neurons for all tokens in the sequence and not 50,000 neurons per token. @3blue1brown is that correct?

How might LLMs store facts | DL7

2024年08月31日　 @alienhunter4870 様　

00:14:40 - 00:22:43

I'm wondering if the phrasing here is a bit misleading. Unless i'm missing something, the block has 50000 neurons but the sequnece of tokens is passed through it, meaning you get number of activations multiplied with number of tokens, not neurons per se. This part might lead someone to thing that those neurons are different for each tokens but they are not. only activations.Regardless, this is an excellent video.

How might LLMs store facts | DL7

2024年08月31日　 @marksverdhei 様　

00:14:40 - 00:22:43

Are "bias" parts of speech "adjectives" and "adverbs?"

How might LLMs store facts | DL7

2024年08月31日　 @TheSpiritualCollective444 様　

00:14:50 - 00:22:43

the one piece is real

How might LLMs store facts | DL7

2024年08月31日　 @samuelgunter 様　

00:15:10 - 00:22:43

ITS REAL

How might LLMs store facts | DL7

2024年08月31日　 @kylewood4001 様　

00:15:11 - 00:22:43

- Counting parameters

How might LLMs store facts | DL7

2024年08月31日　

00:15:38 - 00:17:04

at the parameters for the FF network are counted. Are these the parameters for the FF network of 1 token? If so, does this mean that the total number of parameters, including shared parameters, is much higher?

How might LLMs store facts | DL7

2024年08月31日　 @thomasv92 様　

00:16:08 - 00:22:43

Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.

How might LLMs store facts | DL7

2024年08月31日　 @zw2249 様　

00:16:47 - 00:22:43

- Superposition

How might LLMs store facts | DL7

2024年08月31日　

00:17:04 - 00:21:37

the superposition chapter is great... Watch it guys n girls

How might LLMs store facts | DL7

2024年08月31日　 @FreakAzoiyd 様　

00:17:04 - 00:22:43

@ Re: Superposition - dimensions not being completely independent but rather related.Here's a way to understand superposition that imo was not really clear in the video.

How might LLMs store facts | DL7

2024年08月31日　 @stevenlynch3456 様　

00:17:45 - 00:22:43

Where could someone find the source material or "footnotes/bibliography" found for each LLM's main base for facts and standardized information deemed "valid" by independent accredited main international sources or bodies of information?

How might LLMs store facts | DL7

2024年08月31日　 @TheSpiritualCollective444 様　

00:17:45 - 00:22:43

Is there a way to estimate the amount of additional “dimensions” you get by having 89-91 degrees versus 90 degrees

How might LLMs store facts | DL7

2024年08月31日　 @carlinw 様　

00:18:05 - 00:22:43

this part is really cool!

How might LLMs store facts | DL7

2024年08月31日　 @johnchessant3012 様　

00:18:28 - 00:22:43

.. such that all the vectors are orthogonal is illuminating (~).It suggests that the surface area of the n-dim sphere being partitioned into a vast quantity of locally flat 'Gaussians' (central limit;-) of similarity directions.Once you have that, plus the layer depth to discriminate conceptual level, one gets to see how it works, though doesn't have any explanatory capability because its vocabulary (numeric vectors) does not bake in the human explanatory phrasings we use (all very 'physician heal thyself' given it's an LLM!)

How might LLMs store facts | DL7

2024年08月31日　 @philipoakley5498 様　

00:18:50 - 00:22:43

can't i make it in JavaScript? ^^

How might LLMs store facts | DL7

2024年08月31日　 @Melkanea 様　

00:19:05 - 00:22:43

Important correction: There's an error in the scrappy code I was demoing around , such that in fact not all pairs of vectors end up in that (89°, 91°) range. A few pairs get shot out to have dot products near ±1, hiding in the wings of the plot. I was using a bad cost function that didn't appreciably punish those cases. On closer inspection, it appears not to be possible to get 100k vectors in 100d to be as "nearly orthogonal" as this. 100 dimensions seems to be too low, at least for the (89°, 91°), for the Johnson-Lindenstrauss lemma to really kick in.

How might LLMs store facts | DL7

2024年08月31日　 @3blue1brown 様　

00:19:50 - 00:22:43

a bit unclear how addition of noise to "vectors perpendicularity" can create space for additional features .. can somebody help me to understand that ?

How might LLMs store facts | DL7

2024年08月31日　 @tempdeltavalue 様　

00:19:54 - 00:22:43

another way to imagine is shooting an arrow in space, and shooting a second arrow in 0.001° different direction. The first inch is nothing, nor is the first 20. But as it goes feet and miles out, it'll eventually be so far apart that it's hard to believe they came from the same bow.Also chaotic pendulums, such as a pendulum on the end of a pendulum. Slight changes ends up with completely different movement.

How might LLMs store facts | DL7

2024年08月31日　 @rmt3589 様　

00:20:18 - 00:22:43

more & more it feels as if current networks are mainly our first bookschelves.

How might LLMs store facts | DL7

2024年08月31日　 @Melkanea 様　

00:20:30 - 00:22:43

times as many independent ideas." 💥

How might LLMs store facts | DL7

2024年08月31日　 @FilippoVitaleIT 様　

00:20:31 - 00:22:43

- Bell LaboratoriesI am currently interning at Bell Laboratories :)A fun fact- Yann Lecun created CNNs at part of his internship at Bell Labs

How might LLMs store facts | DL7

2024年08月31日　 @PramodhRachuri 様　

00:21:10 - 00:22:43

This reminds me of bloom filters

How might LLMs store facts | DL7

2024年08月31日　 @alicederyn 様　

00:21:10 - 00:22:43

hey ya'll! 🇺🇿

How might LLMs store facts | DL7

2024年08月31日　 @yapsdotgg 様　

00:21:18 - 00:22:43

it seems obvious to me that a superposition would store more data, not because of nearly perpendicular vectors, but because you're effectively moving from a unary system to a higher base. Same reason you can count to 10, or to 1023, on the same amount of fingers

How might LLMs store facts | DL7

2024年08月31日　 @rlrfproductions 様　

00:21:20 - 00:22:43

E . Skip to there to understand the issue.. Training GPT-4 in 2022 should have taken around a cool thousand years. Then Huang says something silly: He says "Well, they have used a stack of 8000 H100 GPU's and it only took three months" - forgetting that the H100 was only on the drawing board back in 2022 when GPT-4 was trained. Now read a little about the latest discoveries in Brain sciences and I mean especially focus on N400 and P600.. And you tell me how to explain Dan, Rob, Max and Dennis. I'm gonna leave this up to you, as I'm sure you understand what I'm getting at.

How might LLMs store facts | DL7

2024年08月31日　 @nyyotam4057 様　

00:21:33 - 00:22:43

- Up next

How might LLMs store facts | DL7

2024年08月31日　

00:21:37 - 00:22:43

also important in the training process is the concept of self supervised learning to harness the mass of unlabelled data (books in nlp)

How might LLMs store facts | DL7

2024年08月31日　 @hugob8180 様　

00:21:45 - 00:22:43

Holograms coming! :D

How might LLMs store facts | DL7

2024年08月31日　 @PewrityLab 様　

00:22:10 - 00:22:43

uhhh holograms, I'm so excited,and that is on top of a excellent video.I'm amazed how you manage to consistently keep such a high standart :D

How might LLMs store facts | DL7

2024年08月31日　 @AkantorJojo 様　

00:22:15 - 00:22:43

I may have learnt something new here. Grant is saying that the skip connection in the MLP is actually enabling the transformation of the original vector into another vector with enriched contextual meaning. Specifically, in , he is saying that via the summation of the skip connection, the MLP has somehow learnt the directional vector to be added onto the original vector "Michael Jordan", to produced a new output vector that adds "basketball" information. I was originally of the impression that skip connections are only to combat vanishing gradient, and expedite learning. But now Grant is emphasizing it is doing much more!

How might LLMs store facts | DL7

2024年08月31日　 @kwew1 様　

00:22:42 - 00:22:43

チャンネル登録

3Blue1Brown

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。動画はYouTube公式の動画プレイヤーで再生されるため、再生数・収益などはすべて元動画に還元されます。

概要カレンダータイムライン動画一覧タイムテーブル YouTube配信チャンネル分析

Timetable

動画タイムテーブル

タイムテーブルが見つかりませんでした。

Holograms coming! :D（00:22:10 - 00:22:43）How might LLMs store facts | DL7

- Where facts in LLMs live

Starts at

Wait we don't actually know how it works fully?

hold, they build the LLM, and don’t know how the facts are cataloged? This is gonna be a doozy.

- Quick refresher on transformers

to

Only the joker would pick stranger over stronger

Live in a "high dimension" Please expand.

this subtraction of vector makes me wonder if all of category theory can be described using linear algebra

If we consider this higher-dimensional embedding space, in which each direction encodes a specific meaning, each vector in this space represents a certain distinct concept, right? (a vector 'meaning' man, woman, uncle, or aunt, as per the example at ).

Kinda like how neural synapses work...when neurons wire together, they fire together and then it tickles one of the adjacent "dormant" neurons and it lights up with a memory like, "Oh yeah! Totally forgot about that until you just mentioned it again to me...." right?

I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ?

- Assumptions for our toy example

This is unironically how I understand "the spectrum" of autism, for example.

in are you impliying that the vectors are not normalized and therefore a dot product of 1 does not mean they are parallel? So what we call semantic similarity is not a measure of pointing towards the same direction? So it can be that the dot product is 1 in several directions at the same time

How can dot product of a vector with both "Michael" and "Jordan" be 1 when earlier it was said that "Michael" and "Jordan" are nearly orthogonal to each other?

- Inside a multilayer perceptron

Does that sequence of high-dimension vectors (let’s call it a 1D array) in the MLP behave as its own tensor in the LLM?

You are telling me that the AI Is akinator

Who determines "bias" or is it a "vector" with a "code" as well?

) otherwise all borders go through

, "...continuing with the deep learning tradition of overly fancy names..." 😂🤣😂

So this is just an "if, then" function?

The Bias exists to move the border between yes or no (see 0. It is literally the b in y=mx+b. Without it all y=mx go through (0,0)

So the weights are simultaneously nudged to form vector encodings for output words as columns, and patterns in rows to get the values of how much each column should be used based on multiplication by input?

As for the question of what the bias does - it's just a control at what height you put the threshold of the RELU. This way you can clip the data at different values depending on the context.

At I think this might be a misinterpretation. The MLP block uses the same 50,000 neurons for all tokens in the sequence and not 50,000 neurons per token. @3blue1brown is that correct?

Are "bias" parts of speech "adjectives" and "adverbs?"

the one piece is real

ITS REAL

- Counting parameters

at the parameters for the FF network are counted. Are these the parameters for the FF network of 1 token? If so, does this mean that the total number of parameters, including shared parameters, is much higher?

Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.

- Superposition

the superposition chapter is great... Watch it guys n girls

@ Re: Superposition - dimensions not being completely independent but rather related.Here's a way to understand superposition that imo was not really clear in the video.

Where could someone find the source material or "footnotes/bibliography" found for each LLM's main base for facts and standardized information deemed "valid" by independent accredited main international sources or bodies of information?

Is there a way to estimate the amount of additional “dimensions” you get by having 89-91 degrees versus 90 degrees

this part is really cool!

can't i make it in JavaScript? ^^

a bit unclear how addition of noise to "vectors perpendicularity" can create space for additional features .. can somebody help me to understand that ?

more & more it feels as if current networks are mainly our first bookschelves.

times as many independent ideas." 💥

- Bell LaboratoriesI am currently interning at Bell Laboratories :)A fun fact- Yann Lecun created CNNs at part of his internship at Bell Labs

This reminds me of bloom filters

hey ya'll! 🇺🇿

it seems obvious to me that a superposition would store more data, not because of nearly perpendicular vectors, but because you're effectively moving from a unary system to a higher base. Same reason you can count to 10, or to 1023, on the same amount of fingers

- Up next

also important in the training process is the concept of self supervised learning to harness the mass of unlabelled data (books in nlp)

Holograms coming! :D

uhhh holograms, I'm so excited,and that is on top of a excellent video.I'm amazed how you manage to consistently keep such a high standart :D

3Blue1Brown

Timetable

よく話題になっている単語

Holograms coming! :D（00:22:10 - 00:22:43）
How might LLMs store facts | DL7