🛠️ 实现编码和解码功能- 实现编码功能时，需要将文本编码为 token 序列，并按照 merges 字典中的顺序执行合并。- 实现解码功能时，需要将 token 序列解码为原始文本，并根据 merges 字典执行解码过程。- 在实现解码功能时，需要注意处理不符合 UTF-8 格式的情况，常见的做法是使用错误处理参数来避免错误。（00:42:41 - 00:57:24）
Let's build the GPT Tokenizer

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)

Exercises:
- Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code https://github.com/karpathy/minbpe/blob/master/exercise.md

Links:
- Google colab for the video: https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing
- GitHub repo for the video: minBPE https://github.com/karpathy/minbpe
- Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
- our Discord channel: https://discord.gg/3zy8kqD9Cp
- my Twitter: https://twitter.com/karpathy

Supplementary links:
- tiktokenizer https://tiktokenizer.vercel.app
- tiktoken from OpenAI: https://github.com/openai/tiktoken
- sentencepiece from Google https://github.com/google/sentencepiece

intro: Tokenization, GPT-2 paper, tokenization-related issues

*🤖 什么是分词(tokenization)?*- 分词是将文本转换为标记序列的过程。- 在大型语言模型中，分词是将文本转换为标记序列以供模型处理的关键步骤。- 分词的质量和方法直接影响着模型的性能和行为。

- 🧩 Tokenization process overview- Tokenization is crucial for working with large language models- Tokenization converts text into tokens for language model processing

How does it know how DefaultCellStyle is spelled? Is there something in the training data that helps create a mapping from that token to the version with spaces? Did OpenAI maybe augment the training data with 'spelling tables'?

- 🍬 Bite-pair encoding for tokenization- Bite-pair encoding is used in state-of-the-art language models- Tokenization generates vocabularies for language model input- Tokens are fundamental units in large language models

tokenization by example in a Web UI (tiktokenizer)

For these problems mentioned at around (the word "egg" got tokenized in different ways): would it help if we just lower-cased all the text and used an actual dictionary as token vocabulary?

- 🌏 Multilingual tokenization challenges- Non-English languages may have different tokenization challenges- Tokenizers have to handle varying lengths for different languages

@ OFFF Course this legend also speaks Korean! Why wouldn't he?Awesome video Andrej! ❤

omg perfect Korean

Wow his korean speaking is so accurate and accent is incredible. I'm Korean and This brilliant top-notch human(Level of ASI, haha) can do better at anything than me and now even my mother language than me now haha ;)

- 🐍 Tokenization impact on Python coding- Tokenization affects the handling of code in language models- Tokenizer design influences the model's performance for specific languages

strings in Python, Unicode code points

- 🔠 Unicode encodings for text processing- Unicode encodings like UTF-8 are essential for processing text- Different encodings have varying efficiencies and use cases- UTF-8 encoding is preferred for its compatibility and efficiency

Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32

*🧮 字符编码的选择与比较*- UTF-8 在互联网上被广泛采用，因为它是唯一向后兼容 ASCII 编码的字符编码。- UTF-8 相对于其他编码来说更加节省空间，因为它能够更有效地编码文本信息。

*🧩 字节对编码算法简介*- 字节对编码算法通过迭代地识别并替换最常出现的字节对来压缩文本序列。- 该算法能够将原始字节序列压缩到一个较小的固定大小的词汇表中，并实现对任意序列的编码和解码。

- 🧠 Byte Pair Encoding Algorithm Overview- Byte Pair Encoding (BPE) algorithm compresses sequences by finding and merging the most frequent pairs of tokens iteratively.

daydreaming: deleting tokenization

I’m at , and I’m wishing the tokenization was getting at the etymological roots of words and/or meaning of marks in pictographic languages.

Byte Pair Encoding (BPE) algorithm walkthrough

starting the implementation

*🖥️ 字节对编码算法的实现*- 通过 Python 实现字节对编码算法，包括识别最常见字节对、替换、创建新词汇表等步骤。- 使用迭代的方式对文本序列进行多次合并，直到达到所需的词汇表大小。

- 📊 Implementing Byte Pair Encoding Algorithm in Python- Encoding text into UTF-8 tokens and converting them to integers for manipulation.- Identifying the most common pair of tokens and replacing them with new tokens using Python functions.

Hey Andrej, great video! However, at , you don't need to convert all the bytes to integers by using map(). When you call list() on tokens, the bytes are by default converted into integers, so just doing 'list(tokens)' is fine instead of 'list(map(int, tokens))'.

At you don't need map(int, ...) because bytes are already enumerable, so just use tokens = list(tokens)

counting consecutive pairs, finding most common pair

merging the most common pair

GPT4 uses 100000 tokens which is not far from the 150000 that UNICODE defines.

training the tokenizer: adding the while loop, compression ratio

- 🧭 Training and Usage of the Tokenizer- Setting the vocabulary size and performing a fixed number of merges to create the tokenizer.- Discussing the role of the tokenizer as a separate preprocessing stage from the language model.

I'm a total noob, but would there be any benefit instead of taking the whole blog post (around ) and making a .txt file and having the program read it like that as opposed to pasting it as one long line? Just curious if there is pros/cons either way or if it truly doesn't matter

At , in merge, why are we incrementing by 2?Suppose my top pair is (6, 6). In encoded text is [7, 6, 6, 5, 4, 3], code will not be able to replace the (6, 6) with minted token. Am I missing anything?

Shouldn't it be **num_merges = vocab_size - len(set(tokens))** where **len(set(tokens))** is actually 158 instead of 256?

where would you learn how to code like @?

*📊 Tokenizer 训练总结*- Tokenizer 的训练是完全独立于大语言模型的。- Tokenizer 有自己的训练集，使用 BPE 算法进行训练，构建词汇表。- Tokenizer 的训练一次性完成，之后可用于编码和解码。

tokenizer/LLM diagram: it is a completely separate stage

- 🌐 Tokenizer Training Considerations- Highlighting the importance of diverse training sets for tokenizers encompassing various languages and data types.- Explaining the impact of different data representations on the token sequence density and model performance.

decoding tokens to strings

- 🧮 Tokenization of IDS to create tokens- Getting tokens by iterating over IDS and looking up bytes in vocab- Concatenating bytes to create tokens- Decoding bytes back to strings using UTF-8

why at would it matter the order you add the new vocab terms?if you add idx =257 for pair a,b before idx=256 for pair c,d the dictionary is permutation equivariant as a hash table?

encoding strings to tokens

I have a question regarding the encoding process . Why not preprocess the keys of the merges dictionary into byte sequences (in the [0–255] range), and then just do a longest prefix match on the input?We may then benefit from trie-like data structure.

- 🧬 Implementing encoding of string into tokens- Encoding text into UTF-8 to get raw bytes- Performing merges according to lookup dictionary- Identifying pairs for merging and performing merges

I guess next step is to build a vocabulary similar to `decode` and use a trie to encode straight to final tokens?

At , can we not just implement encode by iterating over merges dictionary(the order is maintained) and calling the merge() function on tokens ?This is what I meandef encode(text) :tokens = list(text.encode("utf-8"))for pair, idx in merges.items() : tokens = merge(tokens, pair, idx)return tokens

- 📝 Perfecting the encoding function and testing- Addressing the special case of single character or empty string- Testing encoding and decoding to ensure consistency- Validating the implemented function with training and validation data

I think this question is addressed at .

*🧩 GPT2论文中的Tokenizer*- GPT2论文解释了其使用的Tokenizer，主要采用字节对编码算法（Byte Pair Encoding, BPE）。- 论文指出对常见词汇进行简单的BPE算法合并会导致语义混乱，因此提出了手动制定合并规则的方法。

regex patterns to force splits across categories

*🛠️ GPT2的Tokenizer实现细节*- GPT2的Tokenizer实现包括了一个复杂的正则表达式模式，用于规定哪些部分的文本不应该被合并。- 使用了Python的reex包进行更强大的正则表达式匹配。

"extremely gnarly, and slightly gross" (), is how I feel about ML 99% of the time

*🧰 TikTok Tokenizer 库介绍*- OpenAI发布了TikTok Tokenizer库，用于GPT4的分词工作。- 与GPT2不同，GPT4的Tokenizer将空格合并为一个标记，这在GPT2中是不同的。

tiktoken library intro, differences between GPT-2/GPT-4 regex

I guessing they limit the numerical tokens to a length of 3 because otherwise they would blow out the size of the vocabulary trying to store the various combinations of numbers, or am I off base on that?

The reason they are only matching up to 3 numbers is quite simple:1000000 normally is written as 1,000,000 as you can see only up to 3 numbers per segment is necessary. Applying the pattern will segment the number string into "1" - "," - "000" - "," - "000"

GPT-2 encoder.py released by OpenAI walkthrough

Our variable naming was really good ()

*🤖 tokenizer算法原理*- 开发Tokenizer的算法与OpenAI的实现基本相同。- 理解了算法原理后，能够构建、训练和使用Tokenizer。- OpenAI在实现中添加了一些不太重要的细节，但基本原理保持一致。

special tokens, tiktoken handling of, GPT-2/GPT-4 differences

- 🏷 Special tokens and fine-tuning- Special tokens, like "End of Text," delimit documents in the GPT training set.- Adding special tokens requires model adjustments like extending embedding matrices.- Special tokens are crucial for tasks like fine-tuning a base model into a chatbot model.

oh my, the realization of the year 🔥🔥🔥🔥

what is it short for at ?

minbpe exercise time! write your own GPT-4 tokenizer

Q: What is Andrej's favorite programming language? A: Swift 😁

The moment when you realise there is more to life than research. 😅😂

- 🧠 Tokenization using Sentence Piece- Sentence Piece is used widely in language models for training and inference efficiency.

sentencepiece library intro, used to train Llama 2 vocabulary

- 📜 Configuration and Training with Sentence Piece- Sentence Piece has numerous configuration options available with historical baggage.- The training process includes defining input/output files, selecting algorithms, and preprocessing rules.

how to set vocabulary set? revisiting gpt.py transformer

- 🤖 Vocab Size and Model Architecture- Vocabulary size impacts model training and computational complexity.- Larger vocab sizes can lead to underfitting of rare tokens and compression of information.

- 🛠 Extending Vocab Size in Pre-Trained Models- Pre-trained models can have vocab sizes extended by adding new tokens.- The process involves resizing embeddings and adjusting linear layers for new token probabilities.

training new tokens, example of prompt compression

- 🧠 Fine-tuning Techniques- Training new tokens with distillation technique- Optimizing over new tokens without changing model architecture- Efficiency in fine-tuning by training only token embeddings

multimodal [image, video, audio] tokenization with vector quantization

- 🤖 Processing Multimodal Inputs- Adapting Transformers to process various modalities like images, videos, and audio- Tokenizing input domains for different modalities- Using the same Transformer architecture for different input types

revisiting and explaining the quirks of LLM tokenization

- 📏 Tokenization Algorithm Analysis- Limitations of language models in spelling and simple arithmetic tasks due to tokenization- Differences in tokenization of English and non-English languages- Impact of tokenization on model performance in handling Python coding.

in GPT-4 whatever you put inside "<|" and "|>" behaves the same. E.g., "<|a|>"

*🛑 处理特殊字符串时的模型异常行为*- 处理特殊字符串时的模型异常行为，- 模型可能会在处理特殊字符串时出现意外行为，如停止生成输出或输出无意义结果。- 对特殊字符的处理可能存在漏洞，可能导致模型受到攻击。

🤖 什么是分词(tokenization)?- 分词是将文本转换为标记序列的过程。- 在大型语言模型中，分词是将文本转换为标记序列以供模型处理的关键步骤。- 分词的质量和方法直接影响着模型的性能和行为。

🧮 字符编码的选择与比较- UTF-8 在互联网上被广泛采用，因为它是唯一向后兼容 ASCII 编码的字符编码。- UTF-8 相对于其他编码来说更加节省空间，因为它能够更有效地编码文本信息。

🧩 字节对编码算法简介- 字节对编码算法通过迭代地识别并替换最常出现的字节对来压缩文本序列。- 该算法能够将原始字节序列压缩到一个较小的固定大小的词汇表中，并实现对任意序列的编码和解码。

🖥️ 字节对编码算法的实现- 通过 Python 实现字节对编码算法，包括识别最常见字节对、替换、创建新词汇表等步骤。- 使用迭代的方式对文本序列进行多次合并，直到达到所需的词汇表大小。

Shouldn't it be num_merges = vocab_size - len(set(tokens)) where len(set(tokens)) is actually 158 instead of 256?

📊 Tokenizer 训练总结- Tokenizer 的训练是完全独立于大语言模型的。- Tokenizer 有自己的训练集，使用 BPE 算法进行训练，构建词汇表。- Tokenizer 的训练一次性完成，之后可用于编码和解码。

🧩 GPT2论文中的Tokenizer- GPT2论文解释了其使用的Tokenizer，主要采用字节对编码算法（Byte Pair Encoding, BPE）。- 论文指出对常见词汇进行简单的BPE算法合并会导致语义混乱，因此提出了手动制定合并规则的方法。

🛠️ GPT2的Tokenizer实现细节- GPT2的Tokenizer实现包括了一个复杂的正则表达式模式，用于规定哪些部分的文本不应该被合并。- 使用了Python的reex包进行更强大的正则表达式匹配。

🧰 TikTok Tokenizer 库介绍- OpenAI发布了TikTok Tokenizer库，用于GPT4的分词工作。- 与GPT2不同，GPT4的Tokenizer将空格合并为一个标记，这在GPT2中是不同的。

🤖 tokenizer算法原理- 开发Tokenizer的算法与OpenAI的实现基本相同。- 理解了算法原理后，能够构建、训练和使用Tokenizer。- OpenAI在实现中添加了一些不太重要的细节，但基本原理保持一致。

🛑 处理特殊字符串时的模型异常行为- 处理特殊字符串时的模型异常行为，- 模型可能会在处理特殊字符串时出现意外行为，如停止生成输出或输出无意义结果。- 对特殊字符的处理可能存在漏洞，可能导致模型受到攻击。

🌐 不同格式和语言对 GPT Tokenizer 的影响- 不同格式和语言对 GPT Tokenizer 的影响，- 不同的数据格式和语言可能会影响 GPT Tokenizer 的性能和效率。- 例如，Json 格式可能与 GPT Tokenizer 不太兼容，导致性能下降。

🚧 重视 tokenization 的重要性与挑战- 重视 tokenization 的重要性与挑战，- Tokenization 阶段可能存在安全问题和 AI 安全问题，需要引起重视。- 虽然 tokenization 阶段令人烦恼，但不应忽视其重要性，有待进一步的研究和改进。