Tokenisation
The process of splitting text into tokens using a vocabulary and encoding algorithm like BPE.
Full Definition
Tokenisation converts raw text into a sequence of integer IDs that the model can process. Byte Pair Encoding (BPE), used by GPT models, starts with individual bytes and iteratively merges the most frequent adjacent pairs to build a vocabulary of ~50k–128k subword tokens. WordPiece (BERT) and SentencePiece (T5, Llama) are alternative algorithms. Tokenisation is language- and model-specific: the same text tokenises differently depending on the model, affecting cost, context length, and edge cases. Poorly tokenised inputs (unusual Unicode, code with rare symbols, numbers) can lead to worse model performance because the model sees rare token sequences it encountered infrequently during training.
Examples
The number '1234567890' tokenising to ['12', '345', '678', '90'] — 4 tokens — making arithmetic harder for models than it would be for humans who perceive it as one number.
Chinese text tokenising into more tokens per character than English, increasing API cost for Chinese-language applications.
Apply this in your prompts
PromptITIN automatically uses techniques like Tokenisation to build better prompts for you.