
intro: Tokenization, GPT-2 paper, tokenization-related issues

*ð€ ä»ä¹æ¯åè¯(tokenization)?*- åè¯æ¯å°ææ¬èœ¬æ¢äžºæ è®°åºåçè¿çšã- åšå€§åè¯èšæš¡åäžïŒåè¯æ¯å°ææ¬èœ¬æ¢äžºæ è®°åºåä»¥äŸæš¡åå€ççå ³é®æ¥éª€ã- åè¯ç莚éåæ¹æ³çŽæ¥åœ±åçæš¡åçæ§èœåè¡äžºã

- ð§© Tokenization process overview- Tokenization is crucial for working with large language models- Tokenization converts text into tokens for language model processing

How does it know how DefaultCellStyle is spelled? Is there something in the training data that helps create a mapping from that token to the version with spaces? Did OpenAI maybe augment the training data with 'spelling tables'?

*ð GPT2 䜿çšçåè对çŒç ç®æ³*- åè对çŒç ç®æ³æ¯äžç§åžžçšçåè¯æ¹æ³ïŒçšäºæå»ºå€§åè¯èšæš¡åçæ è®°è¯æ±è¡šã- GPT2 äžçåè¯åšäœ¿çšäºåè对çŒç ç®æ³æ¥æå»ºè¯æ±è¡šïŒå ¶äžæ¯äžª token å¯ä»¥æ¯å€äžªå笊çç»åã- åè对çŒç ç®æ³èœå€çµæŽ»å°å€çåç§è¯èšåç¹æ®å笊ïŒä»èæé«æš¡åçéçšæ§åæ§èœã

- ð¬ Bite-pair encoding for tokenization- Bite-pair encoding is used in state-of-the-art language models- Tokenization generates vocabularies for language model input- Tokens are fundamental units in large language models

tokenization by example in a Web UI (tiktokenizer)

*ð è¯èšæš¡åäžçåè¯é®é¢*- åè¯å¯¹äºè¯èšæš¡åçæ§èœåè¡äžºè³å ³éèŠïŒäœä¹äŒåžŠæ¥äžäºé®é¢åææã- äžåè¯èšçåè¯ææå¯èœäžåïŒç¹å«æ¯éè±æè¯èšå¯èœåå°æ°æ®äžå¹³è¡¡ç圱åã- åè¯æ¹æ³ç讟计åå®ç°å¯¹æš¡åçæçåè¡šç°æéèŠåœ±åïŒéèŠç»Œåèè倿¹é¢å çŽ è¿è¡äŒåã

Hey Andrej, thanks for the new video! I'm not yet done but I noticed at you mentioned "notice that the colour is different, so this is not the same token". But actually in that app, the colours are random, and are just cycling through so as not to have twice the same colours in a row. See e.g. the " +" token with different colours, or all the differently coloured spaces in the python code.

For these problems mentioned at around (the word "egg" got tokenized in different ways): would it help if we just lower-cased all the text and used an actual dictionary as token vocabulary?

- ð Multilingual tokenization challenges- Non-English languages may have different tokenization challenges- Tokenizers have to handle varying lengths for different languages

@ OFFF Course this legend also speaks Korean! Why wouldn't he?Awesome video Andrej! â€

omg perfect Korean

Wow his korean speaking is so accurate and accent is incredible. I'm Korean and This brilliant top-notch human(Level of ASI, haha) can do better at anything than me and now even my mother language than me now haha ;)

- ð Tokenization impact on Python coding- Tokenization affects the handling of code in language models- Tokenizer design influences the model's performance for specific languages

strings in Python, Unicode code points

"Unicode." I despise Unicode with the passion of a million searing fires. I've written enough code to handle Unicode to feel your pain through the screen without you saying a single word about it. ASCII was v1.0 of character handling. Extended ASCII with "Code Pages" was v1.3. Unicode is barely v2.0 and we still haven't gotten it right. So maybe by v3.0, whatever it ends up being called, we'll _finally_ figure out that human language is too complex to represent in computer systems using a set number of bytes for the representation of a character sequence and finally offer something much more flexible and comprehensive that's also compatible/performant with how computer systems work.

- ð Unicode encodings for text processing- Unicode encodings like UTF-8 are essential for processing text- Different encodings have varying efficiencies and use cases- UTF-8 encoding is preferred for its compatibility and efficiency

Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32

*ð§® å笊çŒç çéæ©äžæ¯èŸ*- UTF-8 åšäºèçœäžè¢«å¹¿æ³éçšïŒå äžºå®æ¯å¯äžååå Œå®¹ ASCII çŒç çå笊çŒç ã- UTF-8 çžå¯¹äºå ¶ä»çŒç æ¥è¯ŽæŽå èç空éŽïŒå 䞺å®èœå€æŽææå°çŒç ææ¬ä¿¡æ¯ã

*ð§© åè对çŒç ç®æ³ç®ä»*- åè对çŒç ç®æ³éè¿è¿ä»£å°è¯å«å¹¶æ¿æ¢æåžžåºç°çåè对æ¥åçŒ©ææ¬åºåã- è¯¥ç®æ³èœå€å°åå§åèåºåå猩å°äžäžªèŸå°çåºå®å€§å°çè¯æ±è¡šäžïŒå¹¶å®ç°å¯¹ä»»æåºåççŒç åè§£ç ã

- ð§ Byte Pair Encoding Algorithm Overview- Byte Pair Encoding (BPE) algorithm compresses sequences by finding and merging the most frequent pairs of tokens iteratively.

daydreaming: deleting tokenization

Iâm at , and Iâm wishing the tokenization was getting at the etymological roots of words and/or meaning of marks in pictographic languages.

Byte Pair Encoding (BPE) algorithm walkthrough

starting the implementation

*ð¥ïž åè对çŒç ç®æ³çå®ç°*- éè¿ Python å®ç°åè对çŒç ç®æ³ïŒå æ¬è¯å«æåžžè§åèå¯¹ãæ¿æ¢ãå建æ°è¯æ±è¡šçæ¥éª€ã- 䜿çšè¿ä»£çæ¹åŒå¯¹ææ¬åºåè¿è¡å€æ¬¡åå¹¶ïŒçŽå°èŸŸå°æéçè¯æ±è¡šå€§å°ã

- ð Implementing Byte Pair Encoding Algorithm in Python- Encoding text into UTF-8 tokens and converting them to integers for manipulation.- Identifying the most common pair of tokens and replacing them with new tokens using Python functions.

Hey Andrej, great video! However, at , you don't need to convert all the bytes to integers by using map(). When you call list() on tokens, the bytes are by default converted into integers, so just doing 'list(tokens)' is fine instead of 'list(map(int, tokens))'.

At you don't need map(int, ...) because bytes are already enumerable, so just use tokens = list(tokens)

counting consecutive pairs, finding most common pair

merging the most common pair

I'm jumping in with a comment before finishing the video, but one thing I noticed about this the byte-pair encoding implementation, is that it is agnostic to the UTF-8 character boundaries. So it should be possible that a token only represents the bytes of half of a multi-byte character. In that case, when trying to visualise which characters are part of which token, like in the toktokenizer tool you showed at the start, it couldn't really be visualised properly since one character could be split across two tokens. I wonder if this is the case in GPT's encoding or whether there's a case to make sure characters are always grouped into the same token. I'll keep watching... :D

GPT4 uses 100000 tokens which is not far from the 150000 that UNICODE defines.

training the tokenizer: adding the while loop, compression ratio

- ð§ Training and Usage of the Tokenizer- Setting the vocabulary size and performing a fixed number of merges to create the tokenizer.- Discussing the role of the tokenizer as a separate preprocessing stage from the language model.

I'm a total noob, but would there be any benefit instead of taking the whole blog post (around ) and making a .txt file and having the program read it like that as opposed to pasting it as one long line? Just curious if there is pros/cons either way or if it truly doesn't matter
![At , in merge, why are we incrementing by 2?Suppose my top pair is (6, 6). In encoded text is [7, 6, 6, 5, 4, 3], code will not be able to replace the (6, 6) with minted token. Am I missing anything? - Let's build the GPT Tokenizer](https://img.youtube.com/vi/zduSFxRajkE/mqdefault.jpg)
At , in merge, why are we incrementing by 2?Suppose my top pair is (6, 6). In encoded text is [7, 6, 6, 5, 4, 3], code will not be able to replace the (6, 6) with minted token. Am I missing anything?

Shouldn't it be **num_merges = vocab_size - len(set(tokens))** where **len(set(tokens))** is actually 158 instead of 256?

where would you learn how to code like @?

*ð Tokenizer è®ç»æ»ç»*- Tokenizer çè®ç»æ¯å®å šç¬ç«äºå€§è¯èšæš¡åçã- Tokenizer æèªå·±çè®ç»éïŒäœ¿çš BPE ç®æ³è¿è¡è®ç»ïŒæå»ºè¯æ±è¡šã- Tokenizer çè®ç»äžæ¬¡æ§å®æïŒä¹åå¯çšäºçŒç åè§£ç ã

tokenizer/LLM diagram: it is a completely separate stage

*ð€ Tokenizer çŒç åè§£ç *- Tokenizer æ¯åå§ææ¬å token åºåä¹éŽçç¿»è¯å±ã- å¯ä»¥å°åå§ææ¬çŒç æ token åºåïŒä¹å¯ä»¥å° token åºåè§£ç æåå§ææ¬ã- 倧è¯èšæš¡åçè®ç»æ°æ®éåžžäŒé¢å€ç䞺 token åºåè¿è¡è®ç»ïŒèäžæ¯äœ¿çšåå§ææ¬ã

- ð Tokenizer Training Considerations- Highlighting the importance of diverse training sets for tokenizers encompassing various languages and data types.- Explaining the impact of different data representations on the token sequence density and model performance.

*ð ïž å®ç°çŒç åè§£ç åèœ*- å®ç°çŒç åèœæ¶ïŒéèŠå°ææ¬çŒç 䞺 token åºåïŒå¹¶æç § merges åå žäžçé¡ºåºæ§è¡åå¹¶ã- å®ç°è§£ç åèœæ¶ïŒéèŠå° token åºåè§£ç 䞺åå§ææ¬ïŒå¹¶æ ¹æ® merges åå žæ§è¡è§£ç è¿çšã- åšå®ç°è§£ç åèœæ¶ïŒéèŠæ³šæå€çäžç¬Šå UTF-8 æ ŒåŒçæ åµïŒåžžè§çåæ³æ¯äœ¿çšé误å€çåæ°æ¥é¿å é误ã

decoding tokens to strings

- ð§® Tokenization of IDS to create tokens- Getting tokens by iterating over IDS and looking up bytes in vocab- Concatenating bytes to create tokens- Decoding bytes back to strings using UTF-8

why at would it matter the order you add the new vocab terms?if you add idx =257 for pair a,b before idx=256 for pair c,d the dictionary is permutation equivariant as a hash table?

Ahh, partially addressed at . However this is fixing error when decoding an invalid UTF-8 sequence. Such errors could be minimised by only tokenizing full UTF-8 sequences, so in this example chr(128) wouldn't be its own token as that's only valid as a UTF-8 continuation byte, not as the first byte of a character.

encoding strings to tokens
![I have a question regarding the encoding process . Why not preprocess the keys of the merges dictionary into byte sequences (in the [0â255] range), and then just do a longest prefix match on the input?We may then benefit from trie-like data structure. - Let's build the GPT Tokenizer](https://img.youtube.com/vi/zduSFxRajkE/mqdefault.jpg)
I have a question regarding the encoding process . Why not preprocess the keys of the merges dictionary into byte sequences (in the [0â255] range), and then just do a longest prefix match on the input?We may then benefit from trie-like data structure.

- 𧬠Implementing encoding of string into tokens- Encoding text into UTF-8 to get raw bytes- Performing merges according to lookup dictionary- Identifying pairs for merging and performing merges

I guess next step is to build a vocabulary similar to `decode` and use a trie to encode straight to final tokens?

At , can we not just implement encode by iterating over merges dictionary(the order is maintained) and calling the merge() function on tokens ?This is what I meandef encode(text) :tokens = list(text.encode("utf-8"))for pair, idx in merges.items() : tokens = merge(tokens, pair, idx)return tokens

I am hugely confused at . Why are we writing such a complicated encoder using a while loop and unintuitive stuff like pair = min(stats, key=lambda p: merges.get(p, float("inf")))Why can't I just dodef encode(self, text):tokens = text.encode("utf-8")tokens = list(map(int, tokens))for pair, index in self.merges.items():tokens = merge(tokens, pair, index)

- ð Perfecting the encoding function and testing- Addressing the special case of single character or empty string- Testing encoding and decoding to ensure consistency- Validating the implemented function with training and validation data

I think this question is addressed at .

*ð§© GPT2论æäžçTokenizer*- GPT2论æè§£éäºå ¶äœ¿çšçTokenizerïŒäž»èŠéçšåè对çŒç ç®æ³ïŒByte Pair Encoding, BPEïŒã- 论ææåºå¯¹åžžè§è¯æ±è¿è¡ç®åçBPEç®æ³åå¹¶äŒå¯ŒèŽè¯ä¹æ··ä¹±ïŒå æ€æåºäºæåšå¶å®åå¹¶è§åçæ¹æ³ã

regex patterns to force splits across categories

*ð ïž GPT2çTokenizerå®ç°ç»è*- GPT2çTokenizerå®ç°å æ¬äºäžäžªå€æçæ£åè¡šèŸŸåŒæš¡åŒïŒçšäºè§å®åªäºéšåçææ¬äžåºè¯¥è¢«åå¹¶ã- 䜿çšäºPythonçreexå è¿è¡æŽåŒºå€§çæ£å衚蟟åŒå¹é ã

- ð§© Tokenization rules and inconsistencies- Tokenization rules for apostrophes are inconsistent in uppercase and lowercase letters.- Matching punctuation characters is essential to separate them from letters or numbers.- Understanding whitespace handling in tokenization is crucial, including negative look-ahead assertions.

"extremely gnarly, and slightly gross" (), is how I feel about ML 99% of the time

*ð§° TikTok Tokenizer åºä»ç»*- OpenAIååžäºTikTok TokenizeråºïŒçšäºGPT4çåè¯å·¥äœã- äžGPT2äžåïŒGPT4çTokenizerå°ç©ºæ Œå并䞺äžäžªæ è®°ïŒè¿åšGPT2äžæ¯äžåçã

- ð€ GPT Tokenizer and GPT-3.5 Turbo Scheme- The GPT Tokenizer for GPT-4 uses different merging rules compared to GPT-2.- The GPT-3.5 Turbo Scheme introduces new special tokens for conversation tracking.- Special tokens handling requires additional model adjustments like embedding matrix extension.

tiktoken library intro, differences between GPT-2/GPT-4 regex

*ð GPT4çTokenizeråå*- GPT4çTokenizeräžGPT2çžæ¯è¿è¡äºäžäºä¿®æ¹ïŒå æ¬å¯¹æ£åè¡šèŸŸåŒæš¡åŒçæ¹å以åå¯¹ç©ºæ Œåæ°åçå€çæ¹åŒã- æ£åè¡šèŸŸåŒæš¡åŒäžå¢å äºå¯¹å€§å°åäžææçå¹é ïŒå¹¶éå¶äºæ°ååå¹¶çé¿åºŠïŒä»¥é¿å çæè¿é¿çæ è®°ã

I guessing they limit the numerical tokens to a length of 3 because otherwise they would blow out the size of the vocabulary trying to store the various combinations of numbers, or am I off base on that?

The reason they are only matching up to 3 numbers is quite simple:1000000 normally is written as 1,000,000 as you can see only up to 3 numbers per segment is necessary. Applying the pattern will segment the number string into "1" - "," - "000" - "," - "000"

GPT-2 encoder.py released by OpenAI walkthrough

Our variable naming was really good ()

*ð€ tokenizerç®æ³åç*- åŒåTokenizerçç®æ³äžOpenAIçå®ç°åºæ¬çžåã- çè§£äºç®æ³åçåïŒèœå€æå»ºãè®ç»å䜿çšTokenizerã- OpenAIåšå®ç°äžæ·»å äºäžäºäžå€ªéèŠçç»èïŒäœåºæ¬åçä¿æäžèŽã

I think the reason for the byte encode/decode is to make sure no control codes are stored in the file, since it's being read as text. E.g. 0xA and 0xD are newline characters and those could mess up the file. That said, I haven't looked at the BPE file, just the merges file for CLIP, so it can be different for Open AI.

special tokens, tiktoken handling of, GPT-2/GPT-4 differences

*ð ïž ç¹æ®tokenççšéåå€ç*- ç¹æ®tokençšäºåšæ°æ®äžæ è®°ç¹æ®ç»ææåéäžåéšåã- ç¹æ®tokençæ·»å éèŠå¯¹æš¡åè¿è¡äžå®çä¿®æ¹åè°æŽïŒå æ¬ä¿®æ¹åµå ¥ç©éµåæç»å±çæåœ±ã- è¿ç§æäœåšFine-tuningçä»»å¡äžç¹å«åžžè§ïŒäŸåŠä»åºç¡è¯èšæš¡å蜬æ¢äžºè倩暡åã

- ð· Special tokens and fine-tuning- Special tokens, like "End of Text," delimit documents in the GPT training set.- Adding special tokens requires model adjustments like extending embedding matrices.- Special tokens are crucial for tasks like fine-tuning a base model into a chatbot model.

oh my, the realization of the year ð¥ð¥ð¥ð¥

what is it short for at ?

minbpe exercise time! write your own GPT-4 tokenizer

Q: What is Andrej's favorite programming language? A: Swift ð

The moment when you realise there is more to life than research. ð ð

- ð§ Tokenization using Sentence Piece- Sentence Piece is used widely in language models for training and inference efficiency.

sentencepiece library intro, used to train Llama 2 vocabulary

*ð§© SentencePieceäžTokenizerçæ¯èŸ*- SentencePieceæ¯åŠäžç§åžžçšçæ è®°ååºïŒæ¯æè®ç»åæšçã- å®äœ¿çšäºäžåçæ è®°åæ¹æ³ïŒçŽæ¥åšä»£ç ç¹äžæ§è¡BPEïŒå¯¹äºçšæç代ç ç¹äœ¿çšäºfallbackæºå¶ã- SentencePieceæ¥æå€§éçé 眮é项ïŒäœåšNLPæš¡åäžéåžžéèŠè°æŽä»¥éåºç¹å®ä»»å¡ã

- ð Configuration and Training with Sentence Piece- Sentence Piece has numerous configuration options available with historical baggage.- The training process includes defining input/output files, selecting algorithms, and preprocessing rules.

*ð§© åæ SentencePiece çå·¥äœåçååæ°è®Ÿçœ®*- SentencePiece çå·¥äœåçååæ°è®Ÿçœ®ïŒ- SentencePiece å°ææ¬æä»¶è§äžºåèæµïŒèäžæ¯å¥åïŒéè¿äžç³»åè§åè¿è¡åè¯åçŒç ã- è®ç»æ¶éèŠæå®ç¹æ®æ è®°ïŒåŠ UNKãBOSãEOS å PADïŒå¹¶äžå¿ é¡»ååš UNK æ è®°ã- éè¿ç€ºäŸå±ç€ºäº SentencePiece çè¯æ±è¡šåçŒç è¿çšïŒä»¥ååŠäœå€çæªç¥å笊ååèåéã

how to set vocabulary set? revisiting gpt.py transformer

*ð çè§£ Transformer æš¡åäžç Vocabulary Size*- Transformer æš¡åäžç Vocabulary SizeïŒ- Voab size åš Transformer æš¡åäžåœ±å token embedding table ç倧å°å LM head å±çåæ°æ°éã- Voab size çå¢å äŒå¯ŒèŽæš¡å计ç®éå¢å ãåæ°çšçæ§å¢å ååºåé¿åºŠåå°çé®é¢ã- è°æŽ Voab size æ¯äžé¡¹ç»éªæ§è¶ åæ°è°æŽïŒéåžžåšé«äžå°åäžçº§å«ïŒæ ¹æ®åºçšåºæ¯å计ç®èµæºè¿è¡éæ©ã

- ð€ Vocab Size and Model Architecture- Vocabulary size impacts model training and computational complexity.- Larger vocab sizes can lead to underfitting of rare tokens and compression of information.

- ð Extending Vocab Size in Pre-Trained Models- Pre-trained models can have vocab sizes extended by adding new tokens.- The process involves resizing embeddings and adjusting linear layers for new token probabilities.

training new tokens, example of prompt compression

*ð æ©å± Vocabulary Size ååºçšäºå€æš¡ææ°æ®*- æ©å± Vocabulary Size ååºçšäºå€æš¡ææ°æ®ïŒ- å¯ä»¥éè¿ç®åçæš¡åä¿®æ¹æ¥æ©å± Vocabulary SizeïŒå¹¶ä»ç»äºå»ç»æš¡ååè®ç»æ°åæ°çæ¹æ³ã- 对äºå€æš¡ææ°æ®ïŒå¯ä»¥å°å ¶ä»é¢åçæ°æ®èœ¬æ¢æ tokenïŒå¹¶äœ¿çšçžåç Transformer æš¡åè¿è¡å€çã- åŠæ¯çåå·¥äžçéœåšæ¢çŽ¢åŠäœå° Transformer åºçšäºå€ç倿𡿿°æ®ïŒå¹¶æåºäºåç§åæ°çæ¹æ³åææ¯ã

- ð§ Fine-tuning Techniques- Training new tokens with distillation technique- Optimizing over new tokens without changing model architecture- Efficiency in fine-tuning by training only token embeddings
![multimodal [image, video, audio] tokenization with vector quantization - Let's build the GPT Tokenizer](https://img.youtube.com/vi/zduSFxRajkE/mqdefault.jpg)
multimodal [image, video, audio] tokenization with vector quantization

- ð€ Processing Multimodal Inputs- Adapting Transformers to process various modalities like images, videos, and audio- Tokenizing input domains for different modalities- Using the same Transformer architecture for different input types

revisiting and explaining the quirks of LLM tokenization

- ð Tokenization Algorithm Analysis- Limitations of language models in spelling and simple arithmetic tasks due to tokenization- Differences in tokenization of English and non-English languages- Impact of tokenization on model performance in handling Python coding.

*ð§ Tokenization å¯¹äºæš¡åæ§è¡ç¹å®ä»»å¡ç圱å*- Tokenization å¯¹æš¡åæ§è¡ç¹å®ä»»å¡ç圱åïŒ- é¿ token å¯èœå¯ŒèŽæš¡ååšå€çæäºä»»å¡æ¶è¡šç°äžäœ³ïŒåŠæŒåæ£æ¥æå笊䞲å蜬ã- æš¡ååšå€çéè±è¯è¯èšåç®åç®æ¯æ¶ä¹åå° tokenization ç圱åïŒå¯ŒèŽæ§èœäžéã

in GPT-4 whatever you put inside "<|" and "|>" behaves the same. E.g., "<|a|>"

*ð å€çç¹æ®å笊䞲æ¶çæš¡ååŒåžžè¡äžº*- å€çç¹æ®å笊䞲æ¶çæš¡ååŒåžžè¡äžºïŒ- æš¡åå¯èœäŒåšå€çç¹æ®å笊䞲æ¶åºç°æå€è¡äžºïŒåŠåæ¢çæèŸåºæèŸåºæ æä¹ç»æã- å¯¹ç¹æ®å笊çå€çå¯èœååšæŒæŽïŒå¯èœå¯ŒèŽæš¡ååå°æ»å»ã

My guess is that special tokens are just directly cut from the user provided string.

*â ïž å°Ÿé空çœå笊对暡å衚ç°ç圱å*- å°Ÿé空çœå笊对暡å衚ç°ç圱åïŒ- åšèŸå ¥äžååšå°Ÿé空çœå笊æ¶ïŒæš¡åçæ§èœå¯èœäŒåå°åœ±åïŒå¯ŒèŽèŸåºäžçš³å®æäžåç¡®ã- å°Ÿé空çœå笊å¯èœå¯ŒèŽæš¡åå€çæ°æ®ååžäžäžèŽïŒä»è圱åç»æçäžèŽæ§ã

"Feel the agi" ð "Feel the jank" ð

*ð¥ Tokenization æ°æ®éäžæš¡åè®ç»æ°æ®éäžäžèŽå¯ŒèŽçåŒåžžè¡äžº*- Tokenization æ°æ®éäžæš¡åè®ç»æ°æ®éäžäžèŽå¯ŒèŽçåŒåžžè¡äžºïŒ- åœ tokenization æ°æ®éäžå å«çç¹æ®åç¬Šäž²åšæš¡åè®ç»æ°æ®éäžæªåºç°æ¶ïŒæš¡ååšå€çè¿äºå笊䞲æ¶å¯èœè¡šç°åŒåžžã- æªè®ç»ç token åšæš¡åæšçé¶æ®µå¯èœå¯ŒèŽæªå®ä¹çè¡äžºïŒä»è产ç奿ªçèŸåºæè¡äžºã

*ð äžåæ ŒåŒåè¯èšå¯¹ GPT Tokenizer ç圱å*- äžåæ ŒåŒåè¯èšå¯¹ GPT Tokenizer ç圱åïŒ- äžåçæ°æ®æ ŒåŒåè¯èšå¯èœäŒåœ±å GPT Tokenizer çæ§èœåæçã- äŸåŠïŒJson æ ŒåŒå¯èœäž GPT Tokenizer äžå€ªå Œå®¹ïŒå¯ŒèŽæ§èœäžéã

- ð§® Tokenization efficiency considerations- Different data formats and representations can impact the efficiency of tokenization.

*ð° æ°æ®æ ŒåŒå¯¹ token åæçç圱å*- æ°æ®æ ŒåŒå¯¹ token åæçç圱åïŒ- Yaml æ ŒåŒçžæ¯äº Json æ ŒåŒåš token åæ¶æŽå 髿ïŒåå°äº token çæ°éã- åšè®¡ç® token ææ¬åå€çç»æåæ°æ®æ¶ïŒéæ©æŽé«æççŒç æ ŒåŒå¯ä»¥èçææ¬åæé«æçã

- ð Importance of measuring token efficiencies- Tokenization density is crucial for cost-effective processing of data.- Spending time on measuring token efficiencies across formats is essential.

final recommendations

*ð§ éè§ tokenization çéèŠæ§äžææ*- éè§ tokenization çéèŠæ§äžææïŒ- Tokenization é¶æ®µå¯èœååšå®å šé®é¢å AI å®å šé®é¢ïŒéèŠåŒèµ·éè§ã- èœç¶ tokenization é¶æ®µä»€äººçŠæŒïŒäœäžåºå¿œè§å ¶éèŠæ§ïŒæåŸ è¿äžæ¥çç ç©¶åæ¹è¿ã

- ð Recommendations for tokenization application- Reuse GPT-4 tokens and vocabulary for efficient application.- Consider using libraries like Tech tokenizer for inference.

*ð ïž åºçšå»ºè®®äžæšèçå·¥å ·*- åºçšå»ºè®®äžæšèçå·¥å ·ïŒ- 对äºåºçšçšåºïŒåŠæå¯ä»¥éçš GPT 4 tokens åè¯æ±è¡šïŒåå»ºè®®äœ¿çš Tik tok äœäžºæšççææåºã- 对äºè®ç»èªå·±çè¯æ±è¡šïŒå»ºè®®äœ¿çšåºäºåè级 BPE çæ¹æ³ïŒåŠ Tik tok å OpenAI æäœ¿çšçåè级 BPEã

??? :)
