??? :)(02:12:50 - 02:13:35) - Let's build the GPT Tokenizer

??? :)(02:12:50 - 02:13:35)
Let's build the GPT Tokenizer

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and afte...
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)

Exercises:
- Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code https://github.com/karpathy/minbpe/blob/master/exercise.md

Links:
- Google colab for the video: https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing
- GitHub repo for the video: minBPE https://github.com/karpathy/minbpe
- Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
- our Discord channel: https://discord.gg/3zy8kqD9Cp
- my Twitter:

Supplementary links:
- tiktokenizer https://tiktokenizer.vercel.app
- tiktoken from OpenAI: https://github.com/openai/tiktoken
- sentencepiece from Google https://github.com/google/sentencepiece
intro: Tokenization, GPT-2 paper, tokenization-related issues - Let's build the GPT Tokenizer

intro: Tokenization, GPT-2 paper, tokenization-related issues

Let's build the GPT Tokenizer
2024年02月21日 
00:00:00 - 00:05:50
*🤖 什么是分词(tokenization)?*- 分词是将文本转换为标记序列的过程。- 在大型语言模型中,分词是将文本转换为标记序列以供模型处理的关键步骤。- 分词的质量和方法直接影响着模型的性能和行为。 - Let's build the GPT Tokenizer

*🤖 什么是分词(tokenization)?*- 分词是将文本转换为标记序列的过程。- 在大型语言模型中,分词是将文本转换为标记序列以供模型处理的关键步骤。- 分词的质量和方法直接影响着模型的性能和行为。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:00:00 - 00:02:43
-  🧩 Tokenization process overview- Tokenization is crucial for working with large language models- Tokenization converts text into tokens for language model processing - Let's build the GPT Tokenizer

- 🧩 Tokenization process overview- Tokenization is crucial for working with large language models- Tokenization converts text into tokens for language model processing

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:00:00 - 00:03:13
How does it know how DefaultCellStyle is spelled? Is there something in the training data that helps create a mapping from that token to the version with spaces? Did OpenAI maybe augment the training data with 'spelling tables'? - Let's build the GPT Tokenizer

How does it know how DefaultCellStyle is spelled? Is there something in the training data that helps create a mapping from that token to the version with spaces? Did OpenAI maybe augment the training data with 'spelling tables'?

Let's build the GPT Tokenizer
2024年02月21日  @karlkastor 様 
00:01:53 - 02:13:35
*🔍 GPT2 使用的字节对编码算法*- 字节对编码算法是一种常用的分词方法,用于构建大型语言模型的标记词汇表。- GPT2 中的分词器使用了字节对编码算法来构建词汇表,其中每个 token 可以是多个字符的组合。- 字节对编码算法能够灵活地处理各种语言和特殊字符,从而提高模型的适用性和性能。 - Let's build the GPT Tokenizer

*🔍 GPT2 使用的字节对编码算法*- 字节对编码算法是一种常用的分词方法,用于构建大型语言模型的标记词汇表。- GPT2 中的分词器使用了字节对编码算法来构建词汇表,其中每个 token 可以是多个字符的组合。- 字节对编码算法能够灵活地处理各种语言和特殊字符,从而提高模型的适用性和性能。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:02:43 - 00:05:59
-  🍬 Bite-pair encoding for tokenization- Bite-pair encoding is used in state-of-the-art language models- Tokenization generates vocabularies for language model input- Tokens are fundamental units in large language models - Let's build the GPT Tokenizer

- 🍬 Bite-pair encoding for tokenization- Bite-pair encoding is used in state-of-the-art language models- Tokenization generates vocabularies for language model input- Tokens are fundamental units in large language models

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:03:13 - 00:09:17
tokenization by example in a Web UI (tiktokenizer) - Let's build the GPT Tokenizer

tokenization by example in a Web UI (tiktokenizer)

Let's build the GPT Tokenizer
2024年02月21日 
00:05:50 - 00:14:56
*🌐 语言模型中的分词问题*- 分词对于语言模型的性能和行为至关重要,但也会带来一些问题和挑战。- 不同语言的分词效果可能不同,特别是非英文语言可能受到数据不平衡的影响。- 分词方法的设计和实现对模型的效率和表现有重要影响,需要综合考虑多方面因素进行优化。 - Let's build the GPT Tokenizer

*🌐 语言模型中的分词问题*- 分词对于语言模型的性能和行为至关重要,但也会带来一些问题和挑战。- 不同语言的分词效果可能不同,特别是非英文语言可能受到数据不平衡的影响。- 分词方法的设计和实现对模型的效率和表现有重要影响,需要综合考虑多方面因素进行优化。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:05:59 - 00:19:25
Hey Andrej, thanks for the new video! I'm not yet done but I noticed at  you mentioned "notice that the colour is different, so this is not the same token". But actually in that app, the colours are random, and are just cycling through so as not to have twice the same colours in a row. See e.g. the " +" token with different colours, or all the differently coloured spaces in the python code. - Let's build the GPT Tokenizer

Hey Andrej, thanks for the new video! I'm not yet done but I noticed at you mentioned "notice that the colour is different, so this is not the same token". But actually in that app, the colours are random, and are just cycling through so as not to have twice the same colours in a row. See e.g. the " +" token with different colours, or all the differently coloured spaces in the python code.

Let's build the GPT Tokenizer
2024年02月21日  @松松-l9w 様 
00:08:46 - 02:13:35
For these problems mentioned at around  (the word "egg" got tokenized in different ways): would it help if we just lower-cased all the text and used an actual dictionary as token vocabulary? - Let's build the GPT Tokenizer

For these problems mentioned at around (the word "egg" got tokenized in different ways): would it help if we just lower-cased all the text and used an actual dictionary as token vocabulary?

Let's build the GPT Tokenizer
2024年02月21日  @sunnyvalecaliforia 様 
00:08:55 - 02:13:35
-  🌏 Multilingual tokenization challenges- Non-English languages may have different tokenization challenges- Tokenizers have to handle varying lengths for different languages - Let's build the GPT Tokenizer

- 🌏 Multilingual tokenization challenges- Non-English languages may have different tokenization challenges- Tokenizers have to handle varying lengths for different languages

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:09:17 - 00:14:47
@ OFFF Course this legend also speaks Korean! Why wouldn't he?Awesome video Andrej! ❤ - Let's build the GPT Tokenizer

@ OFFF Course this legend also speaks Korean! Why wouldn't he?Awesome video Andrej! ❤

Let's build the GPT Tokenizer
2024年02月21日  @reza2kn 様 
00:09:38 - 02:13:35
omg perfect Korean - Let's build the GPT Tokenizer

omg perfect Korean

Let's build the GPT Tokenizer
2024年02月21日  @KwangrokRyoo 様 
00:09:38 - 02:13:35
Wow his korean speaking is so accurate and accent is incredible. I'm Korean and This brilliant top-notch human(Level of ASI, haha) can do better at anything than me and now even my mother language than me now haha ;) - Let's build the GPT Tokenizer

Wow his korean speaking is so accurate and accent is incredible. I'm Korean and This brilliant top-notch human(Level of ASI, haha) can do better at anything than me and now even my mother language than me now haha ;)

Let's build the GPT Tokenizer
2024年02月21日  @bayesianlee6447 様 
00:09:39 - 02:13:35
-  🐍 Tokenization impact on Python coding- Tokenization affects the handling of code in language models- Tokenizer design influences the model's performance for specific languages - Let's build the GPT Tokenizer

- 🐍 Tokenization impact on Python coding- Tokenization affects the handling of code in language models- Tokenizer design influences the model's performance for specific languages

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:14:47 - 00:18:13
strings in Python, Unicode code points - Let's build the GPT Tokenizer

strings in Python, Unicode code points

Let's build the GPT Tokenizer
2024年02月21日 
00:14:56 - 00:18:15
"Unicode."  I despise Unicode with the passion of a million searing fires.  I've written enough code to handle Unicode to feel your pain through the screen without you saying a single word about it.  ASCII was v1.0 of character handling.  Extended ASCII with "Code Pages" was v1.3.  Unicode is barely v2.0 and we still haven't gotten it right.  So maybe by v3.0, whatever it ends up being called, we'll _finally_ figure out that human language is too complex to represent in computer systems using a set number of bytes for the representation of a character sequence and finally offer something much more flexible and comprehensive that's also compatible/performant with how computer systems work. - Let's build the GPT Tokenizer

"Unicode." I despise Unicode with the passion of a million searing fires. I've written enough code to handle Unicode to feel your pain through the screen without you saying a single word about it. ASCII was v1.0 of character handling. Extended ASCII with "Code Pages" was v1.3. Unicode is barely v2.0 and we still haven't gotten it right. So maybe by v3.0, whatever it ends up being called, we'll _finally_ figure out that human language is too complex to represent in computer systems using a set number of bytes for the representation of a character sequence and finally offer something much more flexible and comprehensive that's also compatible/performant with how computer systems work.

Let's build the GPT Tokenizer
2024年02月21日  @privacyvalued4134 様 
00:16:00 - 02:13:35
-  🔠 Unicode encodings for text processing- Unicode encodings like UTF-8 are essential for processing text- Different encodings have varying efficiencies and use cases- UTF-8 encoding is preferred for its compatibility and efficiency - Let's build the GPT Tokenizer

- 🔠 Unicode encodings for text processing- Unicode encodings like UTF-8 are essential for processing text- Different encodings have varying efficiencies and use cases- UTF-8 encoding is preferred for its compatibility and efficiency

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:18:13 - 00:22:26
Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 - Let's build the GPT Tokenizer

Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32

Let's build the GPT Tokenizer
2024年02月21日 
00:18:15 - 00:22:47
*🧮 字符编码的选择与比较*- UTF-8 在互联网上被广泛采用,因为它是唯一向后兼容 ASCII 编码的字符编码。- UTF-8 相对于其他编码来说更加节省空间,因为它能够更有效地编码文本信息。 - Let's build the GPT Tokenizer

*🧮 字符编码的选择与比较*- UTF-8 在互联网上被广泛采用,因为它是唯一向后兼容 ASCII 编码的字符编码。- UTF-8 相对于其他编码来说更加节省空间,因为它能够更有效地编码文本信息。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:19:25 - 00:22:01
*🧩 字节对编码算法简介*- 字节对编码算法通过迭代地识别并替换最常出现的字节对来压缩文本序列。- 该算法能够将原始字节序列压缩到一个较小的固定大小的词汇表中,并实现对任意序列的编码和解码。 - Let's build the GPT Tokenizer

*🧩 字节对编码算法简介*- 字节对编码算法通过迭代地识别并替换最常出现的字节对来压缩文本序列。- 该算法能够将原始字节序列压缩到一个较小的固定大小的词汇表中,并实现对任意序列的编码和解码。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:22:01 - 00:27:10
-  🧠 Byte Pair Encoding Algorithm Overview- Byte Pair Encoding (BPE) algorithm compresses sequences by finding and merging the most frequent pairs of tokens iteratively. - Let's build the GPT Tokenizer

- 🧠 Byte Pair Encoding Algorithm Overview- Byte Pair Encoding (BPE) algorithm compresses sequences by finding and merging the most frequent pairs of tokens iteratively.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:22:26 - 00:27:10
daydreaming: deleting tokenization - Let's build the GPT Tokenizer

daydreaming: deleting tokenization

Let's build the GPT Tokenizer
2024年02月21日 
00:22:47 - 00:23:50
I’m at  , and I’m wishing the tokenization was getting at the etymological roots of words and/or meaning of marks in pictographic languages. - Let's build the GPT Tokenizer

I’m at , and I’m wishing the tokenization was getting at the etymological roots of words and/or meaning of marks in pictographic languages.

Let's build the GPT Tokenizer
2024年02月21日  @kurtesimo 様 
00:23:30 - 02:13:35
Byte Pair Encoding (BPE) algorithm walkthrough - Let's build the GPT Tokenizer

Byte Pair Encoding (BPE) algorithm walkthrough

Let's build the GPT Tokenizer
2024年02月21日 
00:23:50 - 00:27:02
starting the implementation - Let's build the GPT Tokenizer

starting the implementation

Let's build the GPT Tokenizer
2024年02月21日 
00:27:02 - 00:28:35
*🖥️ 字节对编码算法的实现*- 通过 Python 实现字节对编码算法,包括识别最常见字节对、替换、创建新词汇表等步骤。- 使用迭代的方式对文本序列进行多次合并,直到达到所需的词汇表大小。 - Let's build the GPT Tokenizer

*🖥️ 字节对编码算法的实现*- 通过 Python 实现字节对编码算法,包括识别最常见字节对、替换、创建新词汇表等步骤。- 使用迭代的方式对文本序列进行多次合并,直到达到所需的词汇表大小。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:27:10 - 00:38:01
-  📊 Implementing Byte Pair Encoding Algorithm in Python- Encoding text into UTF-8 tokens and converting them to integers for manipulation.- Identifying the most common pair of tokens and replacing them with new tokens using Python functions. - Let's build the GPT Tokenizer

- 📊 Implementing Byte Pair Encoding Algorithm in Python- Encoding text into UTF-8 tokens and converting them to integers for manipulation.- Identifying the most common pair of tokens and replacing them with new tokens using Python functions.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:27:10 - 00:35:00
Hey Andrej, great video! However, at , you don't need to convert all the bytes to integers by using map(). When you call list() on tokens, the bytes are by default converted into integers, so just doing 'list(tokens)' is fine instead of 'list(map(int, tokens))'. - Let's build the GPT Tokenizer

Hey Andrej, great video! However, at , you don't need to convert all the bytes to integers by using map(). When you call list() on tokens, the bytes are by default converted into integers, so just doing 'list(tokens)' is fine instead of 'list(map(int, tokens))'.

Let's build the GPT Tokenizer
2024年02月21日  @prateekvellala 様 
00:27:23 - 02:13:35
At  you don't need map(int, ...) because bytes are already enumerable, so just use tokens = list(tokens) - Let's build the GPT Tokenizer

At you don't need map(int, ...) because bytes are already enumerable, so just use tokens = list(tokens)

Let's build the GPT Tokenizer
2024年02月21日  @unperrier 様 
00:27:24 - 02:13:35
counting consecutive pairs, finding most common pair - Let's build the GPT Tokenizer

counting consecutive pairs, finding most common pair

Let's build the GPT Tokenizer
2024年02月21日 
00:28:35 - 00:30:36
merging the most common pair - Let's build the GPT Tokenizer

merging the most common pair

Let's build the GPT Tokenizer
2024年02月21日 
00:30:36 - 00:34:58
I'm jumping in with a comment before finishing the video, but one thing I noticed about this the byte-pair encoding implementation, is that it is agnostic to the UTF-8 character boundaries.  So it should be possible that a token only represents the bytes of half of a multi-byte character.  In that case, when trying to visualise which characters are part of which token, like in the toktokenizer tool you showed at the start, it couldn't really be visualised properly since one character could be split across two tokens.  I wonder if this is the case in GPT's encoding or whether there's a case to make sure characters are always grouped into the same token.  I'll keep watching... :D - Let's build the GPT Tokenizer

I'm jumping in with a comment before finishing the video, but one thing I noticed about this the byte-pair encoding implementation, is that it is agnostic to the UTF-8 character boundaries. So it should be possible that a token only represents the bytes of half of a multi-byte character. In that case, when trying to visualise which characters are part of which token, like in the toktokenizer tool you showed at the start, it couldn't really be visualised properly since one character could be split across two tokens. I wonder if this is the case in GPT's encoding or whether there's a case to make sure characters are always grouped into the same token. I'll keep watching... :D

Let's build the GPT Tokenizer
2024年02月21日  @ashh3051 様 
00:31:03 - 02:13:35
GPT4 uses 100000 tokens which is not far from the 150000 that UNICODE defines. - Let's build the GPT Tokenizer

GPT4 uses 100000 tokens which is not far from the 150000 that UNICODE defines.

Let's build the GPT Tokenizer
2024年02月21日  @dr.emmettbrown7183 様 
00:34:47 - 02:13:35
training the tokenizer: adding the while loop, compression ratio - Let's build the GPT Tokenizer

training the tokenizer: adding the while loop, compression ratio

Let's build the GPT Tokenizer
2024年02月21日 
00:34:58 - 00:39:20
-  🧭 Training and Usage of the Tokenizer- Setting the vocabulary size and performing a fixed number of merges to create the tokenizer.- Discussing the role of the tokenizer as a separate preprocessing stage from the language model. - Let's build the GPT Tokenizer

- 🧭 Training and Usage of the Tokenizer- Setting the vocabulary size and performing a fixed number of merges to create the tokenizer.- Discussing the role of the tokenizer as a separate preprocessing stage from the language model.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:35:00 - 00:41:30
I'm a total noob, but would there be any benefit instead of taking the whole blog post (around ) and making a .txt file and having the program read it like that as opposed to pasting it as one long line? Just curious if there is pros/cons either way or if it truly doesn't matter - Let's build the GPT Tokenizer

I'm a total noob, but would there be any benefit instead of taking the whole blog post (around ) and making a .txt file and having the program read it like that as opposed to pasting it as one long line? Just curious if there is pros/cons either way or if it truly doesn't matter

Let's build the GPT Tokenizer
2024年02月21日  @brendancarnill9625 様 
00:35:31 - 02:13:35
At , in merge, why are we incrementing by 2?Suppose my top pair is (6, 6). In encoded text is [7, 6, 6, 5, 4, 3], code will not be able to replace the (6, 6) with minted token. Am I missing anything? - Let's build the GPT Tokenizer

At , in merge, why are we incrementing by 2?Suppose my top pair is (6, 6). In encoded text is [7, 6, 6, 5, 4, 3], code will not be able to replace the (6, 6) with minted token. Am I missing anything?

Let's build the GPT Tokenizer
2024年02月21日  @1tahirrauf 様 
00:35:43 - 02:13:35
Shouldn't it be **num_merges = vocab_size - len(set(tokens))** where **len(set(tokens))** is actually 158 instead of 256? - Let's build the GPT Tokenizer

Shouldn't it be **num_merges = vocab_size - len(set(tokens))** where **len(set(tokens))** is actually 158 instead of 256?

Let's build the GPT Tokenizer
2024年02月21日  @koza1169 様 
00:36:00 - 02:13:35
where would you learn how to code like @? - Let's build the GPT Tokenizer

where would you learn how to code like @?

Let's build the GPT Tokenizer
2024年02月21日  @hellomyfriend_S2 様 
00:37:37 - 02:13:35
*📊 Tokenizer 训练总结*- Tokenizer 的训练是完全独立于大语言模型的。- Tokenizer 有自己的训练集,使用 BPE 算法进行训练,构建词汇表。- Tokenizer 的训练一次性完成,之后可用于编码和解码。 - Let's build the GPT Tokenizer

*📊 Tokenizer 训练总结*- Tokenizer 的训练是完全独立于大语言模型的。- Tokenizer 有自己的训练集,使用 BPE 算法进行训练,构建词汇表。- Tokenizer 的训练一次性完成,之后可用于编码和解码。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:38:01 - 00:39:38
tokenizer/LLM diagram: it is a completely separate stage - Let's build the GPT Tokenizer

tokenizer/LLM diagram: it is a completely separate stage

Let's build the GPT Tokenizer
2024年02月21日 
00:39:20 - 00:42:47
*🔤 Tokenizer 编码和解码*- Tokenizer 是原始文本和 token 序列之间的翻译层。- 可以将原始文本编码成 token 序列,也可以将 token 序列解码成原始文本。- 大语言模型的训练数据通常会预处理为 token 序列进行训练,而不是使用原始文本。 - Let's build the GPT Tokenizer

*🔤 Tokenizer 编码和解码*- Tokenizer 是原始文本和 token 序列之间的翻译层。- 可以将原始文本编码成 token 序列,也可以将 token 序列解码成原始文本。- 大语言模型的训练数据通常会预处理为 token 序列进行训练,而不是使用原始文本。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:39:38 - 00:42:41
-  🌐 Tokenizer Training Considerations- Highlighting the importance of diverse training sets for tokenizers encompassing various languages and data types.- Explaining the impact of different data representations on the token sequence density and model performance. - Let's build the GPT Tokenizer

- 🌐 Tokenizer Training Considerations- Highlighting the importance of diverse training sets for tokenizers encompassing various languages and data types.- Explaining the impact of different data representations on the token sequence density and model performance.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:41:30 - 00:44:15
*🛠️ 实现编码和解码功能*- 实现编码功能时,需要将文本编码为 token 序列,并按照 merges 字典中的顺序执行合并。- 实现解码功能时,需要将 token 序列解码为原始文本,并根据 merges 字典执行解码过程。- 在实现解码功能时,需要注意处理不符合 UTF-8 格式的情况,常见的做法是使用错误处理参数来避免错误。 - Let's build the GPT Tokenizer

*🛠️ 实现编码和解码功能*- 实现编码功能时,需要将文本编码为 token 序列,并按照 merges 字典中的顺序执行合并。- 实现解码功能时,需要将 token 序列解码为原始文本,并根据 merges 字典执行解码过程。- 在实现解码功能时,需要注意处理不符合 UTF-8 格式的情况,常见的做法是使用错误处理参数来避免错误。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:42:41 - 00:57:24
decoding tokens to strings - Let's build the GPT Tokenizer

decoding tokens to strings

Let's build the GPT Tokenizer
2024年02月21日 
00:42:47 - 00:48:21
-  🧮 Tokenization of IDS to create tokens- Getting tokens by iterating over IDS and looking up bytes in vocab- Concatenating bytes to create tokens- Decoding bytes back to strings using UTF-8 - Let's build the GPT Tokenizer

- 🧮 Tokenization of IDS to create tokens- Getting tokens by iterating over IDS and looking up bytes in vocab- Concatenating bytes to create tokens- Decoding bytes back to strings using UTF-8

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:44:15 - 00:48:33
why at  would it matter the order you add the new vocab terms?if you add idx =257 for pair a,b before idx=256 for pair c,d the dictionary is permutation equivariant as a hash table? - Let's build the GPT Tokenizer

why at would it matter the order you add the new vocab terms?if you add idx =257 for pair a,b before idx=256 for pair c,d the dictionary is permutation equivariant as a hash table?

Let's build the GPT Tokenizer
2024年02月21日  @kkyars 様 
00:44:20 - 02:13:35
Ahh, partially addressed at .  However this is fixing error when decoding an invalid UTF-8 sequence.  Such errors could be minimised by only tokenizing full UTF-8 sequences, so in this example chr(128) wouldn't be its own token as that's only valid as a UTF-8 continuation byte, not as the first byte of a character. - Let's build the GPT Tokenizer

Ahh, partially addressed at . However this is fixing error when decoding an invalid UTF-8 sequence. Such errors could be minimised by only tokenizing full UTF-8 sequences, so in this example chr(128) wouldn't be its own token as that's only valid as a UTF-8 continuation byte, not as the first byte of a character.

Let's build the GPT Tokenizer
2024年02月21日  @ashh3051 様 
00:45:52 - 02:13:35
encoding strings to tokens - Let's build the GPT Tokenizer

encoding strings to tokens

Let's build the GPT Tokenizer
2024年02月21日 
00:48:21 - 00:57:36
I have a question regarding the encoding process . Why not preprocess the keys of the merges dictionary into byte sequences (in the [0–255] range), and then just do a longest prefix match on the input?We may then benefit from trie-like data structure. - Let's build the GPT Tokenizer

I have a question regarding the encoding process . Why not preprocess the keys of the merges dictionary into byte sequences (in the [0–255] range), and then just do a longest prefix match on the input?We may then benefit from trie-like data structure.

Let's build the GPT Tokenizer
2024年02月21日  @mohamedmonsef3471 様 
00:48:22 - 02:13:35
-  🧬 Implementing encoding of string into tokens- Encoding text into UTF-8 to get raw bytes- Performing merges according to lookup dictionary- Identifying pairs for merging and performing merges - Let's build the GPT Tokenizer

- 🧬 Implementing encoding of string into tokens- Encoding text into UTF-8 to get raw bytes- Performing merges according to lookup dictionary- Identifying pairs for merging and performing merges

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:48:33 - 00:55:16
I guess next step is to build a vocabulary similar to `decode` and use a trie to encode straight to final tokens? - Let's build the GPT Tokenizer

I guess next step is to build a vocabulary similar to `decode` and use a trie to encode straight to final tokens?

Let's build the GPT Tokenizer
2024年02月21日  @Rizhiy13 様 
00:54:20 - 02:13:35
At , can we not just implement encode by iterating over merges dictionary(the order is maintained) and calling the merge() function on tokens ?This is what I meandef encode(text) :tokens = list(text.encode("utf-8"))for pair, idx in merges.items() : tokens = merge(tokens, pair, idx)return tokens - Let's build the GPT Tokenizer

At , can we not just implement encode by iterating over merges dictionary(the order is maintained) and calling the merge() function on tokens ?This is what I meandef encode(text) :tokens = list(text.encode("utf-8"))for pair, idx in merges.items() : tokens = merge(tokens, pair, idx)return tokens

Let's build the GPT Tokenizer
2024年02月21日  @azizshameem6241 様 
00:54:55 - 02:13:35
I am hugely confused at . Why are we writing such a complicated encoder using a while loop and unintuitive stuff like pair = min(stats, key=lambda p: merges.get(p, float("inf")))Why can't I just dodef encode(self, text):tokens = text.encode("utf-8")tokens = list(map(int, tokens))for pair, index in self.merges.items():tokens = merge(tokens, pair, index) - Let's build the GPT Tokenizer

I am hugely confused at . Why are we writing such a complicated encoder using a while loop and unintuitive stuff like pair = min(stats, key=lambda p: merges.get(p, float("inf")))Why can't I just dodef encode(self, text):tokens = text.encode("utf-8")tokens = list(map(int, tokens))for pair, index in self.merges.items():tokens = merge(tokens, pair, index)

Let's build the GPT Tokenizer
2024年02月21日  @jackxiao8140 様 
00:55:10 - 02:13:35
-  📝 Perfecting the encoding function and testing- Addressing the special case of single character or empty string- Testing encoding and decoding to ensure consistency- Validating the implemented function with training and validation data - Let's build the GPT Tokenizer

- 📝 Perfecting the encoding function and testing- Addressing the special case of single character or empty string- Testing encoding and decoding to ensure consistency- Validating the implemented function with training and validation data

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
00:55:16 - 01:06:31
I think this question is addressed at . - Let's build the GPT Tokenizer

I think this question is addressed at .

Let's build the GPT Tokenizer
2024年02月21日  @TheFrankyguitar 様 
00:56:12 - 02:13:35
*🧩 GPT2论文中的Tokenizer*- GPT2论文解释了其使用的Tokenizer,主要采用字节对编码算法(Byte Pair Encoding, BPE)。- 论文指出对常见词汇进行简单的BPE算法合并会导致语义混乱,因此提出了手动制定合并规则的方法。 - Let's build the GPT Tokenizer

*🧩 GPT2论文中的Tokenizer*- GPT2论文解释了其使用的Tokenizer,主要采用字节对编码算法(Byte Pair Encoding, BPE)。- 论文指出对常见词汇进行简单的BPE算法合并会导致语义混乱,因此提出了手动制定合并规则的方法。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:57:24 - 00:59:29
regex patterns to force splits across categories - Let's build the GPT Tokenizer

regex patterns to force splits across categories

Let's build the GPT Tokenizer
2024年02月21日 
00:57:36 - 01:11:38
*🛠️ GPT2的Tokenizer实现细节*- GPT2的Tokenizer实现包括了一个复杂的正则表达式模式,用于规定哪些部分的文本不应该被合并。- 使用了Python的reex包进行更强大的正则表达式匹配。 - Let's build the GPT Tokenizer

*🛠️ GPT2的Tokenizer实现细节*- GPT2的Tokenizer实现包括了一个复杂的正则表达式模式,用于规定哪些部分的文本不应该被合并。- 使用了Python的reex包进行更强大的正则表达式匹配。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
00:59:29 - 01:11:08
-  🧩 Tokenization rules and inconsistencies- Tokenization rules for apostrophes are inconsistent in uppercase and lowercase letters.- Matching punctuation characters is essential to separate them from letters or numbers.- Understanding whitespace handling in tokenization is crucial, including negative look-ahead assertions. - Let's build the GPT Tokenizer

- 🧩 Tokenization rules and inconsistencies- Tokenization rules for apostrophes are inconsistent in uppercase and lowercase letters.- Matching punctuation characters is essential to separate them from letters or numbers.- Understanding whitespace handling in tokenization is crucial, including negative look-ahead assertions.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:06:31 - 01:11:08
"extremely gnarly, and slightly gross" (), is how I feel about ML 99% of the time - Let's build the GPT Tokenizer

"extremely gnarly, and slightly gross" (), is how I feel about ML 99% of the time

Let's build the GPT Tokenizer
2024年02月21日  @naromsky 様 
01:07:20 - 02:13:35
*🧰 TikTok Tokenizer 库介绍*- OpenAI发布了TikTok Tokenizer库,用于GPT4的分词工作。- 与GPT2不同,GPT4的Tokenizer将空格合并为一个标记,这在GPT2中是不同的。 - Let's build the GPT Tokenizer

*🧰 TikTok Tokenizer 库介绍*- OpenAI发布了TikTok Tokenizer库,用于GPT4的分词工作。- 与GPT2不同,GPT4的Tokenizer将空格合并为一个标记,这在GPT2中是不同的。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:11:08 - 01:13:12
-  🤖 GPT Tokenizer and GPT-3.5 Turbo Scheme- The GPT Tokenizer for GPT-4 uses different merging rules compared to GPT-2.- The GPT-3.5 Turbo Scheme introduces new special tokens for conversation tracking.- Special tokens handling requires additional model adjustments like embedding matrix extension. - Let's build the GPT Tokenizer

- 🤖 GPT Tokenizer and GPT-3.5 Turbo Scheme- The GPT Tokenizer for GPT-4 uses different merging rules compared to GPT-2.- The GPT-3.5 Turbo Scheme introduces new special tokens for conversation tracking.- Special tokens handling requires additional model adjustments like embedding matrix extension.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:11:08 - 01:18:32
tiktoken library intro, differences between GPT-2/GPT-4 regex - Let's build the GPT Tokenizer

tiktoken library intro, differences between GPT-2/GPT-4 regex

Let's build the GPT Tokenizer
2024年02月21日 
01:11:38 - 01:14:59
*🔍 GPT4的Tokenizer变化*- GPT4的Tokenizer与GPT2相比进行了一些修改,包括对正则表达式模式的改变以及对空格和数字的处理方式。- 正则表达式模式中增加了对大小写不敏感的匹配,并限制了数字合并的长度,以避免生成过长的标记。 - Let's build the GPT Tokenizer

*🔍 GPT4的Tokenizer变化*- GPT4的Tokenizer与GPT2相比进行了一些修改,包括对正则表达式模式的改变以及对空格和数字的处理方式。- 正则表达式模式中增加了对大小写不敏感的匹配,并限制了数字合并的长度,以避免生成过长的标记。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:13:12 - 01:16:40
I guessing they limit the numerical tokens to a length of 3 because otherwise they would blow out the size of the vocabulary trying to store the various combinations of numbers, or am I off base on that? - Let's build the GPT Tokenizer

I guessing they limit the numerical tokens to a length of 3 because otherwise they would blow out the size of the vocabulary trying to store the various combinations of numbers, or am I off base on that?

Let's build the GPT Tokenizer
2024年02月21日  @K9Megahertz 様 
01:14:20 - 02:13:35
The reason they are only matching up to 3 numbers is quite simple:1000000 normally is written as 1,000,000 as you can see only up to 3 numbers per segment is necessary. Applying the pattern will segment the number string into "1" - "," - "000" - "," - "000" - Let's build the GPT Tokenizer

The reason they are only matching up to 3 numbers is quite simple:1000000 normally is written as 1,000,000 as you can see only up to 3 numbers per segment is necessary. Applying the pattern will segment the number string into "1" - "," - "000" - "," - "000"

Let's build the GPT Tokenizer
2024年02月21日  @antoniosapostolou2907 様 
01:14:20 - 02:13:35
GPT-2 encoder.py released by OpenAI walkthrough - Let's build the GPT Tokenizer

GPT-2 encoder.py released by OpenAI walkthrough

Let's build the GPT Tokenizer
2024年02月21日 
01:14:59 - 01:18:26
Our variable naming was really good () - Let's build the GPT Tokenizer

Our variable naming was really good ()

Let's build the GPT Tokenizer
2024年02月21日  @waytolegacy 様 
01:16:20 - 02:13:35
*🤖 tokenizer算法原理*- 开发Tokenizer的算法与OpenAI的实现基本相同。- 理解了算法原理后,能够构建、训练和使用Tokenizer。- OpenAI在实现中添加了一些不太重要的细节,但基本原理保持一致。 - Let's build the GPT Tokenizer

*🤖 tokenizer算法原理*- 开发Tokenizer的算法与OpenAI的实现基本相同。- 理解了算法原理后,能够构建、训练和使用Tokenizer。- OpenAI在实现中添加了一些不太重要的细节,但基本原理保持一致。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:16:40 - 01:18:32
I think the reason for the byte encode/decode is to make sure no control codes are stored in the file, since it's being read as text. E.g. 0xA and 0xD are newline characters and those could mess up the file. That said, I haven't looked at the BPE file, just the merges file for CLIP, so it can be different for Open AI. - Let's build the GPT Tokenizer

I think the reason for the byte encode/decode is to make sure no control codes are stored in the file, since it's being read as text. E.g. 0xA and 0xD are newline characters and those could mess up the file. That said, I haven't looked at the BPE file, just the merges file for CLIP, so it can be different for Open AI.

Let's build the GPT Tokenizer
2024年02月21日  @phizc 様 
01:17:00 - 02:13:35
special tokens, tiktoken handling of, GPT-2/GPT-4 differences - Let's build the GPT Tokenizer

special tokens, tiktoken handling of, GPT-2/GPT-4 differences

Let's build the GPT Tokenizer
2024年02月21日 
01:18:26 - 01:25:28
*🛠️ 特殊token的用途和处理*- 特殊token用于在数据中标记特殊结构或分隔不同部分。- 特殊token的添加需要对模型进行一定的修改和调整,包括修改嵌入矩阵和最终层的投影。- 这种操作在Fine-tuning等任务中特别常见,例如从基础语言模型转换为聊天模型。 - Let's build the GPT Tokenizer

*🛠️ 特殊token的用途和处理*- 特殊token用于在数据中标记特殊结构或分隔不同部分。- 特殊token的添加需要对模型进行一定的修改和调整,包括修改嵌入矩阵和最终层的投影。- 这种操作在Fine-tuning等任务中特别常见,例如从基础语言模型转换为聊天模型。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:18:32 - 01:28:55
-  🏷 Special tokens and fine-tuning- Special tokens, like "End of Text," delimit documents in the GPT training set.- Adding special tokens requires model adjustments like extending embedding matrices.- Special tokens are crucial for tasks like fine-tuning a base model into a chatbot model. - Let's build the GPT Tokenizer

- 🏷 Special tokens and fine-tuning- Special tokens, like "End of Text," delimit documents in the GPT training set.- Adding special tokens requires model adjustments like extending embedding matrices.- Special tokens are crucial for tasks like fine-tuning a base model into a chatbot model.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:18:32 - 01:28:41
oh my, the realization of the year 🔥🔥🔥🔥 - Let's build the GPT Tokenizer

oh my, the realization of the year 🔥🔥🔥🔥

Let's build the GPT Tokenizer
2024年02月21日  @waytolegacy 様 
01:19:34 - 02:13:35
what is it short for at ? - Let's build the GPT Tokenizer

what is it short for at ?

Let's build the GPT Tokenizer
2024年02月21日  @kkyars 様 
01:22:40 - 02:13:35
minbpe exercise time! write your own GPT-4 tokenizer - Let's build the GPT Tokenizer

minbpe exercise time! write your own GPT-4 tokenizer

Let's build the GPT Tokenizer
2024年02月21日 
01:25:28 - 01:28:42
Q: What is Andrej's favorite programming language? A: Swift 😁 - Let's build the GPT Tokenizer

Q: What is Andrej's favorite programming language? A: Swift 😁

Let's build the GPT Tokenizer
2024年02月21日  @abhishekraok 様 
01:27:50 - 02:13:35
The moment when you realise there is more to life than research. 😅😂 - Let's build the GPT Tokenizer

The moment when you realise there is more to life than research. 😅😂

Let's build the GPT Tokenizer
2024年02月21日  @coralexbadea 様 
01:27:50 - 02:13:35
-  🧠 Tokenization using Sentence Piece- Sentence Piece is used widely in language models for training and inference efficiency. - Let's build the GPT Tokenizer

- 🧠 Tokenization using Sentence Piece- Sentence Piece is used widely in language models for training and inference efficiency.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:28:41 - 01:31:23
sentencepiece library intro, used to train Llama 2 vocabulary - Let's build the GPT Tokenizer

sentencepiece library intro, used to train Llama 2 vocabulary

Let's build the GPT Tokenizer
2024年02月21日 
01:28:42 - 01:43:27
*🧩 SentencePiece与Tokenizer的比较*- SentencePiece是另一种常用的标记化库,支持训练和推理。- 它使用了不同的标记化方法,直接在代码点上执行BPE,对于稀有的代码点使用了fallback机制。- SentencePiece拥有大量的配置选项,但在NLP模型中通常需要调整以适应特定任务。 - Let's build the GPT Tokenizer

*🧩 SentencePiece与Tokenizer的比较*- SentencePiece是另一种常用的标记化库,支持训练和推理。- 它使用了不同的标记化方法,直接在代码点上执行BPE,对于稀有的代码点使用了fallback机制。- SentencePiece拥有大量的配置选项,但在NLP模型中通常需要调整以适应特定任务。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:28:55 - 01:34:08
-  📜 Configuration and Training with Sentence Piece- Sentence Piece has numerous configuration options available with historical baggage.- The training process includes defining input/output files, selecting algorithms, and preprocessing rules. - Let's build the GPT Tokenizer

- 📜 Configuration and Training with Sentence Piece- Sentence Piece has numerous configuration options available with historical baggage.- The training process includes defining input/output files, selecting algorithms, and preprocessing rules.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:31:23 - 01:43:31
*🧩 分析 SentencePiece 的工作原理和参数设置*- SentencePiece 的工作原理和参数设置,- SentencePiece 将文本文件视为字节流,而不是句子,通过一系列规则进行分词和编码。- 训练时需要指定特殊标记,如 UNK、BOS、EOS 和 PAD,并且必须存在 UNK 标记。- 通过示例展示了 SentencePiece 的词汇表和编码过程,以及如何处理未知字符和字节回退。 - Let's build the GPT Tokenizer

*🧩 分析 SentencePiece 的工作原理和参数设置*- SentencePiece 的工作原理和参数设置,- SentencePiece 将文本文件视为字节流,而不是句子,通过一系列规则进行分词和编码。- 训练时需要指定特殊标记,如 UNK、BOS、EOS 和 PAD,并且必须存在 UNK 标记。- 通过示例展示了 SentencePiece 的词汇表和编码过程,以及如何处理未知字符和字节回退。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:34:08 - 01:43:31
how to set vocabulary set? revisiting gpt.py transformer - Let's build the GPT Tokenizer

how to set vocabulary set? revisiting gpt.py transformer

Let's build the GPT Tokenizer
2024年02月21日 
01:43:27 - 01:48:11
*🔍 理解 Transformer 模型中的 Vocabulary Size*- Transformer 模型中的 Vocabulary Size,- Voab size 在 Transformer 模型中影响 token embedding table 的大小和 LM head 层的参数数量。- Voab size 的增加会导致模型计算量增加、参数稀疏性增加和序列长度减少等问题。- 调整 Voab size 是一项经验性超参数调整,通常在高万到十万级别,根据应用场景和计算资源进行选择。 - Let's build the GPT Tokenizer

*🔍 理解 Transformer 模型中的 Vocabulary Size*- Transformer 模型中的 Vocabulary Size,- Voab size 在 Transformer 模型中影响 token embedding table 的大小和 LM head 层的参数数量。- Voab size 的增加会导致模型计算量增加、参数稀疏性增加和序列长度减少等问题。- 调整 Voab size 是一项经验性超参数调整,通常在高万到十万级别,根据应用场景和计算资源进行选择。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:43:31 - 01:48:11
-  🤖 Vocab Size and Model Architecture- Vocabulary size impacts model training and computational complexity.- Larger vocab sizes can lead to underfitting of rare tokens and compression of information. - Let's build the GPT Tokenizer

- 🤖 Vocab Size and Model Architecture- Vocabulary size impacts model training and computational complexity.- Larger vocab sizes can lead to underfitting of rare tokens and compression of information.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:43:31 - 01:47:02
-  🛠 Extending Vocab Size in Pre-Trained Models- Pre-trained models can have vocab sizes extended by adding new tokens.- The process involves resizing embeddings and adjusting linear layers for new token probabilities. - Let's build the GPT Tokenizer

- 🛠 Extending Vocab Size in Pre-Trained Models- Pre-trained models can have vocab sizes extended by adding new tokens.- The process involves resizing embeddings and adjusting linear layers for new token probabilities.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:47:02 - 01:48:54
training new tokens, example of prompt compression - Let's build the GPT Tokenizer

training new tokens, example of prompt compression

Let's build the GPT Tokenizer
2024年02月21日 
01:48:11 - 01:49:58
*🔄 扩展 Vocabulary Size 和应用于多模态数据*- 扩展 Vocabulary Size 和应用于多模态数据,- 可以通过简单的模型修改来扩展 Vocabulary Size,并介绍了冻结模型和训练新参数的方法。- 对于多模态数据,可以将其他领域的数据转换成 token,并使用相同的 Transformer 模型进行处理。- 学术界和工业界都在探索如何将 Transformer 应用于处理多模态数据,并提出了各种创新的方法和技术。 - Let's build the GPT Tokenizer

*🔄 扩展 Vocabulary Size 和应用于多模态数据*- 扩展 Vocabulary Size 和应用于多模态数据,- 可以通过简单的模型修改来扩展 Vocabulary Size,并介绍了冻结模型和训练新参数的方法。- 对于多模态数据,可以将其他领域的数据转换成 token,并使用相同的 Transformer 模型进行处理。- 学术界和工业界都在探索如何将 Transformer 应用于处理多模态数据,并提出了各种创新的方法和技术。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:48:11 - 01:51:56
-  🧠 Fine-tuning Techniques- Training new tokens with distillation technique- Optimizing over new tokens without changing model architecture- Efficiency in fine-tuning by training only token embeddings - Let's build the GPT Tokenizer

- 🧠 Fine-tuning Techniques- Training new tokens with distillation technique- Optimizing over new tokens without changing model architecture- Efficiency in fine-tuning by training only token embeddings

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:48:54 - 01:50:05
multimodal [image, video, audio] tokenization with vector quantization - Let's build the GPT Tokenizer

multimodal [image, video, audio] tokenization with vector quantization

Let's build the GPT Tokenizer
2024年02月21日 
01:49:58 - 01:51:41
-  🤖 Processing Multimodal Inputs- Adapting Transformers to process various modalities like images, videos, and audio- Tokenizing input domains for different modalities- Using the same Transformer architecture for different input types - Let's build the GPT Tokenizer

- 🤖 Processing Multimodal Inputs- Adapting Transformers to process various modalities like images, videos, and audio- Tokenizing input domains for different modalities- Using the same Transformer architecture for different input types

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:50:05 - 01:51:42
revisiting and explaining the quirks of LLM tokenization - Let's build the GPT Tokenizer

revisiting and explaining the quirks of LLM tokenization

Let's build the GPT Tokenizer
2024年02月21日 
01:51:41 - 02:10:20
-  📏 Tokenization Algorithm Analysis- Limitations of language models in spelling and simple arithmetic tasks due to tokenization- Differences in tokenization of English and non-English languages- Impact of tokenization on model performance in handling Python coding. - Let's build the GPT Tokenizer

- 📏 Tokenization Algorithm Analysis- Limitations of language models in spelling and simple arithmetic tasks due to tokenization- Differences in tokenization of English and non-English languages- Impact of tokenization on model performance in handling Python coding.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
01:51:42 - 02:09:21
*🧠 Tokenization 对于模型执行特定任务的影响*- Tokenization 对模型执行特定任务的影响,- 长 token 可能导致模型在处理某些任务时表现不佳,如拼写检查或字符串反转。- 模型在处理非英语语言和简单算术时也受到 tokenization 的影响,导致性能下降。 - Let's build the GPT Tokenizer

*🧠 Tokenization 对于模型执行特定任务的影响*- Tokenization 对模型执行特定任务的影响,- 长 token 可能导致模型在处理某些任务时表现不佳,如拼写检查或字符串反转。- 模型在处理非英语语言和简单算术时也受到 tokenization 的影响,导致性能下降。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:51:56 - 01:57:25
in GPT-4 whatever you put inside "<|" and "|>" behaves the same. E.g., "<|a|>" - Let's build the GPT Tokenizer

in GPT-4 whatever you put inside "<|" and "|>" behaves the same. E.g., "<|a|>"

Let's build the GPT Tokenizer
2024年02月21日  @revolutionarydefeatism 様 
01:57:20 - 02:13:35
*🛑 处理特殊字符串时的模型异常行为*- 处理特殊字符串时的模型异常行为,- 模型可能会在处理特殊字符串时出现意外行为,如停止生成输出或输出无意义结果。- 对特殊字符的处理可能存在漏洞,可能导致模型受到攻击。 - Let's build the GPT Tokenizer

*🛑 处理特殊字符串时的模型异常行为*- 处理特殊字符串时的模型异常行为,- 模型可能会在处理特殊字符串时出现意外行为,如停止生成输出或输出无意义结果。- 对特殊字符的处理可能存在漏洞,可能导致模型受到攻击。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:57:25 - 01:59:00
My guess is that special tokens are just directly cut from the user provided string. - Let's build the GPT Tokenizer

My guess is that special tokens are just directly cut from the user provided string.

Let's build the GPT Tokenizer
2024年02月21日  @LachlanJG 様 
01:58:21 - 02:13:35
*⚠️ 尾随空白字符对模型表现的影响*- 尾随空白字符对模型表现的影响,- 在输入中存在尾随空白字符时,模型的性能可能会受到影响,导致输出不稳定或不准确。- 尾随空白字符可能导致模型处理数据分布不一致,从而影响结果的一致性。 - Let's build the GPT Tokenizer

*⚠️ 尾随空白字符对模型表现的影响*- 尾随空白字符对模型表现的影响,- 在输入中存在尾随空白字符时,模型的性能可能会受到影响,导致输出不稳定或不准确。- 尾随空白字符可能导致模型处理数据分布不一致,从而影响结果的一致性。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
01:59:00 - 02:04:59
"Feel the agi" 🙅 "Feel the jank" 👌 - Let's build the GPT Tokenizer

"Feel the agi" 🙅 "Feel the jank" 👌

Let's build the GPT Tokenizer
2024年02月21日  @shivakumarmahesh8096 様 
02:03:08 - 02:13:35
*💥 Tokenization 数据集与模型训练数据集不一致导致的异常行为*- Tokenization 数据集与模型训练数据集不一致导致的异常行为,- 当 tokenization 数据集中包含的特殊字符串在模型训练数据集中未出现时,模型在处理这些字符串时可能表现异常。- 未训练的 token 在模型推理阶段可能导致未定义的行为,从而产生奇怪的输出或行为。 - Let's build the GPT Tokenizer

*💥 Tokenization 数据集与模型训练数据集不一致导致的异常行为*- Tokenization 数据集与模型训练数据集不一致导致的异常行为,- 当 tokenization 数据集中包含的特殊字符串在模型训练数据集中未出现时,模型在处理这些字符串时可能表现异常。- 未训练的 token 在模型推理阶段可能导致未定义的行为,从而产生奇怪的输出或行为。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
02:04:59 - 02:09:21
*🌐 不同格式和语言对 GPT Tokenizer 的影响*- 不同格式和语言对 GPT Tokenizer 的影响,- 不同的数据格式和语言可能会影响 GPT Tokenizer 的性能和效率。- 例如,Json 格式可能与 GPT Tokenizer 不太兼容,导致性能下降。 - Let's build the GPT Tokenizer

*🌐 不同格式和语言对 GPT Tokenizer 的影响*- 不同格式和语言对 GPT Tokenizer 的影响,- 不同的数据格式和语言可能会影响 GPT Tokenizer 的性能和效率。- 例如,Json 格式可能与 GPT Tokenizer 不太兼容,导致性能下降。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
02:09:21 - 02:09:33
-  🧮 Tokenization efficiency considerations- Different data formats and representations can impact the efficiency of tokenization. - Let's build the GPT Tokenizer

- 🧮 Tokenization efficiency considerations- Different data formats and representations can impact the efficiency of tokenization.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
02:09:21 - 02:10:16
*💰 数据格式对 token 化效率的影响*- 数据格式对 token 化效率的影响,- Yaml 格式相比于 Json 格式在 token 化时更加高效,减少了 token 的数量。- 在计算 token 成本和处理结构化数据时,选择更高效的编码格式可以节省成本和提高效率。 - Let's build the GPT Tokenizer

*💰 数据格式对 token 化效率的影响*- 数据格式对 token 化效率的影响,- Yaml 格式相比于 Json 格式在 token 化时更加高效,减少了 token 的数量。- 在计算 token 成本和处理结构化数据时,选择更高效的编码格式可以节省成本和提高效率。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
02:09:33 - 02:10:30
-  🔑 Importance of measuring token efficiencies- Tokenization density is crucial for cost-effective processing of data.- Spending time on measuring token efficiencies across formats is essential. - Let's build the GPT Tokenizer

- 🔑 Importance of measuring token efficiencies- Tokenization density is crucial for cost-effective processing of data.- Spending time on measuring token efficiencies across formats is essential.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
02:10:16 - 02:10:57
final recommendations - Let's build the GPT Tokenizer

final recommendations

Let's build the GPT Tokenizer
2024年02月21日 
02:10:20 - 02:12:50
*🚧 重视 tokenization 的重要性与挑战*- 重视 tokenization 的重要性与挑战,- Tokenization 阶段可能存在安全问题和 AI 安全问题,需要引起重视。- 虽然 tokenization 阶段令人烦恼,但不应忽视其重要性,有待进一步的研究和改进。 - Let's build the GPT Tokenizer

*🚧 重视 tokenization 的重要性与挑战*- 重视 tokenization 的重要性与挑战,- Tokenization 阶段可能存在安全问题和 AI 安全问题,需要引起重视。- 虽然 tokenization 阶段令人烦恼,但不应忽视其重要性,有待进一步的研究和改进。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
02:10:30 - 02:11:11
-  🛠 Recommendations for tokenization application- Reuse GPT-4 tokens and vocabulary for efficient application.- Consider using libraries like Tech tokenizer for inference. - Let's build the GPT Tokenizer

- 🛠 Recommendations for tokenization application- Reuse GPT-4 tokens and vocabulary for efficient application.- Consider using libraries like Tech tokenizer for inference.

Let's build the GPT Tokenizer
2024年02月21日  @Gaurav-pq2ug 様 
02:10:57 - 02:13:35
*🛠️ 应用建议与推荐的工具*- 应用建议与推荐的工具,- 对于应用程序,如果可以重用 GPT 4 tokens 和词汇表,则建议使用 Tik tok 作为推理的有效库。- 对于训练自己的词汇表,建议使用基于字节级 BPE 的方法,如 Tik tok 和 OpenAI 所使用的字节级 BPE。 - Let's build the GPT Tokenizer

*🛠️ 应用建议与推荐的工具*- 应用建议与推荐的工具,- 对于应用程序,如果可以重用 GPT 4 tokens 和词汇表,则建议使用 Tik tok 作为推理的有效库。- 对于训练自己的词汇表,建议使用基于字节级 BPE 的方法,如 Tik tok 和 OpenAI 所使用的字节级 BPE。

Let's build the GPT Tokenizer
2024年02月21日  @Ahwu_AIClass 様 
02:11:11 - 02:13:35
??? :) - Let's build the GPT Tokenizer

??? :)

Let's build the GPT Tokenizer
2024年02月21日 
02:12:50 - 02:13:35
, it the real fun for seeing him making mistakes and re-recording them all. I enjoyed this a lot .Thanks Andrej Sir... - Let's build the GPT Tokenizer

, it the real fun for seeing him making mistakes and re-recording them all. I enjoyed this a lot .Thanks Andrej Sir...

Let's build the GPT Tokenizer
2024年02月21日  @luficerg2007 様 
02:13:00 - 02:13:35

Andrej Karpathy

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。動画はYouTube公式の動画プレイヤーで再生されるため、再生数・収益などはすべて元動画に還元されます。

Timetable

動画タイムテーブル

タイムテーブルが見つかりませんでした。