ai-engineer.sh
GitHub

How Tokenizers Are Built

Tokenization explained what a token is — the discrete unit an LLM actually reads. This page is about the tokenizer that produces them: how it's built, the three approaches you can take, and why nearly every modern LLM lands on the same one — subword tokenization built with Byte-Pair Encoding (BPE).

Two phases: training and inference

A tokenizer has two distinct moments in its life: the training phase, where its vocabulary is built once, and the inference phase, where that fixed vocabulary is used over and over.

Training phase

During training you feed the tokenizer a huge pile of cleaned text. It splits that text into small units, then collects every unique unit into a vocabulary — a table that maps each token to an integer ID.

The first step is text splitting: break the long stream of text into smaller pieces. The second is building the vocabulary: gather the unique pieces and assign each one an ID. Every unique unit is what we call a token, and the token-to-ID table is the vocabulary.

Don't confuse training the tokenizer with training the model. Building the vocabulary just produces that lookup table. The model is trained separately — it learns statistical patterns between token IDs by optimizing billions of neural-network parameters.

Inference phase

Once the vocabulary exists, it's frozen. At inference time the tokenizer just applies it — in both directions:

When you send text to an LLM, the tokenizer encodes it into a sequence of token IDs. The model turns those IDs into embedding vectors, runs its computation, and predicts the next token ID one at a time. The tokenizer then decodes the generated IDs back into text.

A tokenizer is reversible: encoding turns text into IDs, decoding turns IDs back into text. The same vocabulary drives both directions. The model itself never works with words — only with the numeric IDs and their embeddings.

Three families of tokenizer

The part that actually differs between LLMs is the splitting algorithm — how text becomes tokens. There are three broad approaches:

Word-level and character-level tokenizers are rarely used to train modern frontier models. Almost every current LLM uses subword-level tokenization — the other two have drawbacks that make them a poor fit at scale.

Word-level

A word-level tokenizer splits text on whitespace and punctuation, so each token is a whole word. Perfectly fine becomes [Perfectly, fine], then maps to IDs like [52141, 7060].

It looks natural, but the vocabulary explodes. The internet contains an enormous number of distinct words — across languages, plus jargon, company and product names, personal names, URLs, typos, and code. The vocabulary balloons to hundreds of thousands or even millions of entries:

TokenID
a0
about1
after2
zebra270030
!270131

A huge vocabulary inflates memory: the embedding matrix and output layer must cover every token. Worse is the out-of-vocabulary (OOV) problem — the model can hit a word it never saw in training and has no way to represent it.

Character-level

A character-level tokenizer splits text into individual characters. Perfectly fine becomes 14 tokens: [P, e, r, f, e, c, t, l, y, (space), f, i, n, e]. Each character — letter, digit, space, punctuation, or any Unicode symbol — gets an ID.

TokenID
a0
b1
A26
!57
<SPACE>105

The vocabulary is tiny and there's no OOV problem. But sequences get very long — 14 tokens where word-level used 2. Long sequences mean far more computation, since the Transformer has to process and relate every token. The model also has to learn how characters combine into words and how those words carry meaning, which makes learning harder.

Subword-level

Subword-level tokenization is the compromise, and today's standard. Frequent words get their own token; rarer or unseen words are broken into smaller subword pieces. Perfectly fine might become [Perfect, ly, fine].

TokenID
the0
of1
home2
##ing50252
##ed50253
##able50254
<EOS>50255
<SPACE>50256

The ## prefix marks a piece that continues the previous token (so walking[walk, ##ing]). This gives a moderate vocabulary, manageable sequence lengths, and graceful handling of brand-new words.

ApproachVocabulary sizeSequence lengthUnknown wordsUsed for frontier models
Word-levelVery large (100K–millions)ShortBreaks (OOV)Rarely
Character-levelTiny (~hundreds)Very longNo OOV, but no word senseRarely
Subword-levelModerate (50K–200K)ModerateComposed from subwordsStandard

Modern LLMs typically use vocabularies between 50,000 and 200,000 tokens — GPT-2 used ~50K, GPT-4's cl100k ~100K, and GPT-4o's o200k 200K. It's a deliberate trade-off between vocabulary size and sequence length.

Byte-Pair Encoding, step by step

The most popular algorithm for building a subword vocabulary is Byte-Pair Encoding (BPE). (WordPiece and SentencePiece are close relatives.) BPE was originally a text-compression algorithm; OpenAI adopted it for tokenizing GPT.

The key idea: start from individual characters, then repeatedly merge the most frequent adjacent pair into a new, larger token.

Say the training data contains these words:

Plain Text
low
lower
lowest

BPE first represents each word as a sequence of characters:

Plain Text
low     → l o w
lower   → l o w e r
lowest  → l o w e s t

It then scans the whole corpus and counts how often each adjacent pair occurs. The goal isn't to find whole words — it's to find patterns that repeat often. The pair (l, o) appears in all three words, so it's merged into a new token lo:

Plain Text
lo w
lo w e r
lo w e s t

It recounts pairs and merges again. If (lo, w) is now the most frequent pair, it becomes low:

Plain Text
low
low e r
low e s t

This repeats thousands — sometimes tens of thousands — of times over huge amounts of text, until the vocabulary reaches its target size. Very frequent patterns earn their own tokens. Sometimes those are whole words (low, house, computer); sometimes they're word-parts (ing, tion, ment, able).

Why start from characters?

It seems odd that BPE starts from characters when character-level tokenization has the long-sequence problem. But characters are only the starting point for building the vocabulary — BPE's whole job is to merge them into larger, useful units. Take programming:

Plain Text
Character-level:  p r o g r a m m i n g     (11 tokens)
BPE:              program ming               (2 tokens)

If programming is common enough, BPE may even keep it as a single token. So text that character-level tokenization would shatter into thousands of tokens, BPE often represents in far fewer. And a brand-new word the model never saw can still be built from existing subword pieces — running[run, ning] — so there's no hard OOV wall.

Shorter sequences mean less memory, less computation, and more efficient training and inference. That's why BPE became the sweet spot between word-level and character-level approaches, and why most modern LLMs use some variant of subword tokenization.

BPE merges are learned at build time. The pair-counting and merging described above runs while the tokenizer is created — not every time you send a message. At inference the learned merges are simply applied to your text.

Don't build your own

You rarely need to build a tokenizer from scratch. Mature open-source tokenizers exist — the best known is OpenAI's tiktoken, a BPE tokenizer you can drop into your own training or inference pipeline.

To see tokenization happen live across different models, try the Tiktokenizer playground — paste text and watch how each tokenizer splits it and assigns IDs.

Further reading


Related: Tokenization · How LLMs Are Built

Edit this page on GitHub