Tokens
Introduction
What is the relationship between tokens and prompts? Prompts are user inputs / text that will be send towards the model. Tokens are basically chunks from the user inputs, in which the the whole text are split up approximately 4~5 characters == 1 token (varies depending on implementation)
Why Is It Important?
LLMs are not human that can interpret words however, they are amazing at interpreting numbers and associating objects to values. It will be inefficient to generate a long token value if a word could be broken up.
Having smaller numerical value (ID) would be easier to perform calculation in vectorised formats. This results in faster processing for predictions of next chunk (in numerical value). Accuracy of prediction will be dependent on algorithms that handle context between each numerical value. Details will be discussed further in Transformers for Deep Learning.
A vocabulary (think of it as database) is created based on prefix number of tokens which will be used for contextual and semantic linking of tokens. Transformers will make use of this vocabulary to perform matrix calculation. As such, bigger vocabulary == more computational power needed thus, limiting vocabulary size is necessary. At the same time, vocabulary size limiting means any tokens not within vocabulary may result in loss of information during contextual linking.
Types of Token
Word-based
Splits inputs based on delimter(s) or RegEx (punctuations and rule-based are subsets). Different libraries implementation NLTK
, spaCy
, Keras
, Gensim
in python to use.
Punctuations can make model training difficult as coffee? != coffee. != coffee
. Variations of same word would exponentially increase, causing processing time for model training and testing.
Character-based
Splits word into individual characters to have smaller vocabulary and require fewer tokens compared to word-based since English language has 256 differenct characters and ~170,000 words.
Advantages:
Very little to no unknown words for vocabulary
Misspelled words can be corrected to known words and prevent loss of information
Disadvantages:
A character does not carry semantic menaing
Each word would carry more tokens than other types
knowledge == 9 tokens
Subword-Based
Aims to address issues of word-based and character-based
Principles
Do not split frequently used words into smaller subwords
Split rare words into smaller meaningful words
"token" becomes root word while "ization" is additional information to append to the root word. This improves model efficiency by giving different words meaning to add onto the root word if inputed. Further improvements can be done:
Using special symbols can help dictate the suffix or prefix of the root word to ensure accuracy towards the vocabulary and that there is no loss or misinterpreted information. Model can process unknown words by decomposing input into known subwords.
Types
Byte-Pair Encoding by GPT
Token Limiting
As mentioned previously, depending on the word, the tokens will vary. But does that mean you can have unlimited tokens for each input? No, too much tokens will hog up the model processing as it requires even more computational resource to fully compute all tokens. Token limitation per input (and additional factors) will be added to allow model to run efficiently and reliably over long periods.
Truncation
Clip words from either ends but model will suffer loss of information due to missing context.
Chunk Processing
Breaking text into smaller chunks for processing and results concatenated to form a single output. Final result may be prone to errors due to potential loss of information if chunking cuts apart the word incorrectly.
Summarize
Long text may not add value to overall meaning. Summarise input to maximise value.
Remove Redundant Terms
Stop word removal technique is used in NLP such as "to" and "the".
Not reliable in complex sentences due to missing context so manual verification or using another algorithm to better retain the meaning before perform this technique.
Fine-tuning Language Models
Use existing weights of model to continue training with specific data to improve the model processing.
Interview Questions
Could you explain some of the tokenization methods used?
What are some of the ways to format tokens under usage limitation?
Given a long complex paragraph, explain your choice of algorithms/strategies.
Author
References
Last updated