ai-engineer.sh
GitHub

Tokenization

LLMs don't read text. They read numbers. Before a model can process a single word, that word has to be converted into a sequence of integers — and the step that does this is tokenization.

Tokenization converts raw text on the left into a long sequence of discrete numbers on the right
Tokenization turns raw text into a long sequence of discrete numbers the model can process.

Tokenization follows a specific algorithm — a fixed set of rules that splits text into units called tokens and maps each one to a number. It's the bridge between human-readable text and the numerical input a neural network expects.

What is a token

A token is the fundamental unit of text an LLM processes. You send tokens to the model, you're billed per token, and the model reads, predicts, and generates in tokens — not words.

A token is not the same as a word. When you type Hello world! into ChatGPT, it doesn't see two words and a punctuation mark. It sees four distinct tokens:

Plain Text
"Hello world!"  →  ["Hello", " world", "!", "\n"]

Note that the space and the trailing newline are part of the tokens — tokenizers preserve whitespace because it carries meaning.

Depending on the tokenizer, a single token can represent:

  • a single character
  • a subword (part of a word)
  • a complete word
  • a punctuation mark
  • a whitespace or special character

A useful mental model is LEGO. The model never sees whole sentences as words — it sees a set of reusable bricks it can snap together:

Plain Text
"I love machine learning!"

[I] [ love] [ machine] [ learn] [ing] [!]

The word "learning" gets split into learn + ing — two reusable bricks the model can recombine into "learned", "learner", or "burning".

Token ≠ word

Because tokens are subword pieces, the token count of a text rarely matches its word count. This is the whole point: LLMs have limited vocabularies — typically 30,000 to 100,000 tokens. A fixed vocabulary of that size cannot contain every word in every language, so tokenizers break rare or complex words into smaller, reusable pieces:

Plain Text
"extraordinary"  →  "extra" + "ordinary"

By composing subwords, a limited vocabulary can express an unlimited space of words. The most common algorithm for learning these pieces is Byte-Pair Encoding (BPE), which starts from individual characters and repeatedly merges the most frequent adjacent pair until the vocabulary reaches its target size.

For a step-by-step walkthrough of how BPE builds a vocabulary — and the three families of tokenizer it belongs to — see How Tokenizers Are Built.

Why tokenization matters

Tokenization isn't just plumbing — its quality shapes how well and how cheaply a model works.

ReasonWhy it matters
Vocabulary managementA 30K–100K vocabulary must cover unlimited language. Subword splitting makes that possible.
Handling unknown wordsA word the model never saw — like "biocatalyst" — can still be understood as bio + catalyst, two known pieces.
EfficiencySequence length drives compute cost. Fewer tokens per text means cheaper, faster processing.
Model performancePoor tokenization hurts comprehension and generation — especially for non-English text and specialized domains.

From tokens to numbers

Splitting text into tokens is only half the job. Each token still has to become a number the network can do math on.

Every token in the vocabulary is assigned a unique integer — its token ID:

Plain Text
"Hello"   →  token ID 15496
" world"  →  token ID 995

Those IDs are then converted into embeddings: dense vectors of real numbers, typically 512, 1024, or more dimensions. The embedding for "Hello" might look like [0.23, -0.45, 0.78, ...]. Embeddings are what let the model represent meaning — similar tokens end up with similar vectors.

The full pipeline from text to model input:

Plain Text
Text:        "Hello world"
  ↓ Tokenization
Tokens:      ["Hello", " world"]
  ↓ Token IDs
IDs:         [15496, 995]
  ↓ Embedding layer
Vectors:     [[0.23, -0.45, 0.78, ...], [0.12, 0.89, -0.34, ...]]
  ↓ Transformer
Neural network processes the numerical vectors

Every company uses its own tokenization algorithm, and its efficiency directly affects both the model's context capacity and the quality of its output.

Context window

Because the model works in tokens, its limits are measured in tokens too. The context window is the maximum number of tokens a model can process at once. It affects:

  • Input length — how much text the model can consider before responding
  • Output length — how much it can generate in a single completion
  • Coherence — how well it stays consistent across long conversations or documents

More efficient tokenization means more real content fits inside the same context window.

Where tokenization breaks down

Tokenization causes several counter-intuitive behaviors in practice.

Non-English languages

Tokenizers are trained mostly on English text, simply because the internet is dominated by English. So they learned to tokenize English efficiently — and everything else less so.

Imagine a 50,000-word dictionary built that way:

  • ~40,000 English words
  • ~5,000 words for one other language
  • ~5,000 for every remaining language combined

The effect shows up directly in token counts. An English word the tokenizer knows well splits into few pieces, while an equivalent word in another language gets shattered into many:

Plain Text
"unhappiness"  →  ["un", "happiness"]        = 2 tokens
"nesreća"      →  ["n", "es", "re", "ć", "a"] = 5 tokens

Same meaning, more than double the tokens — which means non-English users pay more and fit less into the context window. (Exact splits depend on the tokenizer; these are illustrative.)

How much more, exactly

The gap isn't trivial — peer-reviewed studies have measured it across dozens of languages. The ratios cluster by writing system:

Language familyTokens vs the same meaning in EnglishPractical cost multiplier
English / Latin (baseline)
Cyrillic (Serbian, Russian, Ukrainian, …)2–3×~2–3×
Chinese / Japanese / Korean (CJK)~1 token per character vs ~4 chars/token in English~4–5×
Other non-Latin / morphologically complex (Hindi, Arabic, Hebrew, …)3–5×3–5×

Two technical reasons compound:

  • BPE was trained on mostly English text. Byte-Pair Encoding learns merges from frequency in the training corpus. Non-Latin characters didn't appear often enough during BPE training to form many merges, so they stay split into small pieces.
  • UTF-8 encoding penalty. Latin characters are 1 byte each, but Cyrillic, CJK, and many other scripts encode as 2–3 bytes per character. Even before tokenization there's a byte-level overhead the merges can't fully recover.

The implication is direct: the smart zone of your context window shrinks proportionally to your tokenization cost. A 200K-token window holding ~150K English tokens of usable context holds only ~50K–75K equivalent Chinese tokens, and ~60K–100K Cyrillic ones. This is also why writing docs, commit messages, and code comments in English isn't just a style choice — it's a 2–5× efficiency multiplier for any agent reading the repo.

The arXiv paper Language Model Tokenizers Introduce Unfairness Between Languages frames this as a structural inequity, not an implementation bug. Speakers of underrepresented languages pay more for the same task — both in dollars and in effective context.

Special characters

Emoji and unusual formatting can consume far more tokens than expected. A single emoji like 🧠 may take several tokens depending on the tokenizer, quietly inflating token usage.

Numbers and code

Some tokenizers fragment numbers and programming constructs in odd ways, splitting them across multiple tokens. This is a big reason LLMs struggle with arithmetic and exact code generation — they see fragmented symbols, not clean numbers.


For how tokens fit into the bigger picture of building a model, see How LLMs Are Built.

Sources

Further reading

  • Let's build the GPT Tokenizer — Andrej Karpathy builds a GPT tokenizer from scratch; the final section explains the exact quirks (numbers, non-English cost) this page covers.
  • Hugging Face NLP Course — Tokenizers — the canonical free chapter on what tokenizers do and the three subword algorithms.
  • openai/tiktoken — OpenAI's fast BPE tokenizer; count tokens and inspect token IDs for the exact models you call.
Edit this page on GitHub