
- Where facts in LLMs live

Starts at

Wait we don't actually know how it works fully?

hold, they build the LLM, and don’t know how the facts are cataloged? This is gonna be a doozy.

- Quick refresher on transformers

My understanding is that an LLM does not in fact store facts, but through the process of predicting word associations through being trained on an absolutely astoundingly large set of examples, it "stores" the likelihood that the word "basketball" is historically the most likely next word in the series. It doesn't have any grasp of a concept of basketball in any sort of meaningful or even static way. This is exactly the problem I'm trying to solve, and honestly I think I found a solution. I just don't know yet how reliable it is on a large scale, or how economic it is in terms of the required computing power. We'll see.

to

what would be simplified is half a circumference with two half circumferences inside it, into infinity... that would show some precision...as to how the machine AI is doing this efficiently...while increasing the accuracy of predictions... the selection line on the graph..sequence of vectors lines...attention lines

: The tokens (words) convey context information in each other making the embedding a richer/nuanced version than a simple meaning of the word. When this animation is shown, the arrows are shown moving from a later token to earlier token as well. Isn't this contradictory to the concept introduced during masking where it is said that only earlier words are allowed to enrich the later words. (This is a common animation shown multiple time in this series).

Only the joker would pick stranger over stronger

Live in a "high dimension" Please expand.

I was unable to reproduce woman-man ~= aunt-uncle using either OpenAIEmbedding model 'text-embedding-3-small' or the older 'text-embedding-ada-002' model using LangChain. Cosine similarity of 0.29. I tried lots of pairings: aunt-uncle, woman-man, sister-brother, and queen-king. All had cosine similarities in the range 0.29 to 0.38. Happy to share my work if you're curious.

this subtraction of vector makes me wonder if all of category theory can be described using linear algebra

If we consider this higher-dimensional embedding space, in which each direction encodes a specific meaning, each vector in this space represents a certain distinct concept, right? (a vector 'meaning' man, woman, uncle, or aunt, as per the example at ).

Kinda like how neural synapses work...when neurons wire together, they fire together and then it tickles one of the adjacent "dormant" neurons and it lights up with a memory like, "Oh yeah! Totally forgot about that until you just mentioned it again to me...." right?
![I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ? - How might LLMs store facts | DL7](https://img.youtube.com/vi/9-Jl0dxWQs8/mqdefault.jpg)
I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ?

- Assumptions for our toy example

This is unironically how I understand "the spectrum" of autism, for example.

in are you impliying that the vectors are not normalized and therefore a dot product of 1 does not mean they are parallel? So what we call semantic similarity is not a measure of pointing towards the same direction? So it can be that the dot product is 1 in several directions at the same time

I don't understand the assumptions made around about dot products. Why dot product being 1 is used to mean that the vector encodes that particular direction/concept? I would have thought that the vector needs to be parallel to that concept vector to assume it encodes that concept. But then a vector would only be able to encode one concept. Is this why dot product=1 is just sort of conventionally chosen?

How can dot product of a vector with both "Michael" and "Jordan" be 1 when earlier it was said that "Michael" and "Jordan" are nearly orthogonal to each other?

- Inside a multilayer perceptron

Does that sequence of high-dimension vectors (let’s call it a 1D array) in the MLP behave as its own tensor in the LLM?

You are telling me that the AI Is akinator

Who determines "bias" or is it a "vector" with a "code" as well?

) otherwise all borders go through

, "...continuing with the deep learning tradition of overly fancy names..." 😂🤣😂

So this is just an "if, then" function?

The Bias exists to move the border between yes or no (see 0. It is literally the b in y=mx+b. Without it all y=mx go through (0,0)

So the weights are simultaneously nudged to form vector encodings for output words as columns, and patterns in rows to get the values of how much each column should be used based on multiplication by input?

As for the question of what the bias does - it's just a control at what height you put the threshold of the RELU. This way you can clip the data at different values depending on the context.

At I think this might be a misinterpretation. The MLP block uses the same 50,000 neurons for all tokens in the sequence and not 50,000 neurons per token. @3blue1brown is that correct?

I'm wondering if the phrasing here is a bit misleading. Unless i'm missing something, the block has 50000 neurons but the sequnece of tokens is passed through it, meaning you get number of activations multiplied with number of tokens, not neurons per se. This part might lead someone to thing that those neurons are different for each tokens but they are not. only activations.Regardless, this is an excellent video.

Are "bias" parts of speech "adjectives" and "adverbs?"

the one piece is real

ITS REAL

- Counting parameters

at the parameters for the FF network are counted. Are these the parameters for the FF network of 1 token? If so, does this mean that the total number of parameters, including shared parameters, is much higher?

Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.

- Superposition

the superposition chapter is great... Watch it guys n girls

@ Re: Superposition - dimensions not being completely independent but rather related.Here's a way to understand superposition that imo was not really clear in the video.

Where could someone find the source material or "footnotes/bibliography" found for each LLM's main base for facts and standardized information deemed "valid" by independent accredited main international sources or bodies of information?

Is there a way to estimate the amount of additional “dimensions” you get by having 89-91 degrees versus 90 degrees

this part is really cool!

.. such that all the vectors are orthogonal is illuminating (~).It suggests that the surface area of the n-dim sphere being partitioned into a vast quantity of locally flat 'Gaussians' (central limit;-) of similarity directions.Once you have that, plus the layer depth to discriminate conceptual level, one gets to see how it works, though doesn't have any explanatory capability because its vocabulary (numeric vectors) does not bake in the human explanatory phrasings we use (all very 'physician heal thyself' given it's an LLM!)

can't i make it in JavaScript? ^^

Important correction: There's an error in the scrappy code I was demoing around , such that in fact not all pairs of vectors end up in that (89°, 91°) range. A few pairs get shot out to have dot products near ±1, hiding in the wings of the plot. I was using a bad cost function that didn't appreciably punish those cases. On closer inspection, it appears not to be possible to get 100k vectors in 100d to be as "nearly orthogonal" as this. 100 dimensions seems to be too low, at least for the (89°, 91°), for the Johnson-Lindenstrauss lemma to really kick in.

a bit unclear how addition of noise to "vectors perpendicularity" can create space for additional features .. can somebody help me to understand that ?

another way to imagine is shooting an arrow in space, and shooting a second arrow in 0.001° different direction. The first inch is nothing, nor is the first 20. But as it goes feet and miles out, it'll eventually be so far apart that it's hard to believe they came from the same bow.Also chaotic pendulums, such as a pendulum on the end of a pendulum. Slight changes ends up with completely different movement.

more & more it feels as if current networks are mainly our first bookschelves.

times as many independent ideas." 💥

- Bell LaboratoriesI am currently interning at Bell Laboratories :)A fun fact- Yann Lecun created CNNs at part of his internship at Bell Labs

This reminds me of bloom filters

hey ya'll! 🇺🇿

it seems obvious to me that a superposition would store more data, not because of nearly perpendicular vectors, but because you're effectively moving from a unary system to a higher base. Same reason you can count to 10, or to 1023, on the same amount of fingers

E . Skip to there to understand the issue.. Training GPT-4 in 2022 should have taken around a cool thousand years. Then Huang says something silly: He says "Well, they have used a stack of 8000 H100 GPU's and it only took three months" - forgetting that the H100 was only on the drawing board back in 2022 when GPT-4 was trained. Now read a little about the latest discoveries in Brain sciences and I mean especially focus on N400 and P600.. And you tell me how to explain Dan, Rob, Max and Dennis. I'm gonna leave this up to you, as I'm sure you understand what I'm getting at.

- Up next

also important in the training process is the concept of self supervised learning to harness the mass of unlabelled data (books in nlp)

Holograms coming! :D

uhhh holograms, I'm so excited,and that is on top of a excellent video.I'm amazed how you manage to consistently keep such a high standart :D
