Technical

Tokenisation

The process of splitting text into tokens using a vocabulary and encoding algorithm like BPE.

Full Definition

Tokenisation converts raw text into a sequence of integer IDs that the model can process. Byte Pair Encoding (BPE), used by GPT models, starts with individual bytes and iteratively merges the most frequent adjacent pairs to build a vocabulary of ~50k–128k subword tokens. WordPiece (BERT) and SentencePiece (T5, Llama) are alternative algorithms. Tokenisation is language- and model-specific: the same text tokenises differently depending on the model, affecting cost, context length, and edge cases. Poorly tokenised inputs (unusual Unicode, code with rare symbols, numbers) can lead to worse model performance because the model sees rare token sequences it encountered infrequently during training.

Examples

The number '1234567890' tokenising to ['12', '345', '678', '90'] — 4 tokens — making arithmetic harder for models than it would be for humans who perceive it as one number.

Chinese text tokenising into more tokens per character than English, increasing API cost for Chinese-language applications.

Apply this in your prompts

PromptITIN automatically uses techniques like Tokenisation to build better prompts for you.

✦ Try it free

Related Terms

Token

The basic unit of text a language model processes, roughly corresponding to a wo…

View →

Embedding

A dense numerical vector that represents a token, sentence, or document in a con…

View →

Context Length

The maximum number of tokens a model can handle in a single forward pass.…

View →

← Browse all 100 terms