The same byte-pair encoding idea every textbook describes — but written the way GPT-2, tiktoken, and HuggingFace's tokenizers actually implement it: incremental pair counts, a reverse index from pairs to words, and a priority-queue encoder.
BPE training is conceptually four lines: count adjacent symbol pairs, merge the most frequent one, repeat until the vocabulary is full. A textbook implementation following that recipe literally is hundreds of times too slow to train on a real corpus. This page walks through the optimization that makes BPE production-grade — keeping pair counts incremental so a merge only touches the words it actually changes — and then does the same for inference-time encoding.
A trained BPE tokenizer is two artifacts:
Training learns the merge table. Encoding replays it on new text. We tackle them in that order.
| Symbol | Description |
|---|---|
| \(V\) | Target vocabulary size |
| \(N\) | Number of distinct words (pre-tokens) in the corpus |
| \(L\) | Average length of a word in symbols |
| \(c_w\) | Frequency of word \(w\) in the corpus |
| \(P\) | Number of distinct adjacent pairs currently in play |
| \((A,B) \to AB\) | A merge: replace the adjacent pair \(AB\) by a new symbol |
Start each word as a sequence of bytes. Aggregate identical words with their counts so a 100-million-token corpus becomes a few million unique (word, count) pairs. Then loop:
# Naive BPE training
words = {w: (bytes(w), count) for w, count in word_counts.items()}
vocab = set(initial_bytes)
merges = []
while len(vocab) < V:
# (a) recount every adjacent pair across the whole corpus
pair_counts = defaultdict(int)
for w, (syms, c) in words.items():
for a, b in zip(syms, syms[1:]):
pair_counts[(a, b)] += c
# (b) pick the most frequent pair
(A, B), _ = max(pair_counts.items(), key=lambda kv: kv[1])
AB = A + B
vocab.add(AB)
merges.append((A, B))
# (c) rewrite every word, replacing (A,B) with AB everywhere
for w in words:
syms, c = words[w]
words[w] = (replace_pair(syms, A, B, AB), c)
That is the recipe in the original Sennrich et al. paper, written almost verbatim. It is correct. It is also unusable on real corpora.
Per merge, step (a) walks every adjacent pair in every word: \(\mathcal{O}(N L)\). Step (c) rewrites every word: another \(\mathcal{O}(N L)\). We do \(V\) of these. The total is:
\[ T_{\text{naive}} \;=\; \mathcal{O}\!\big(V \cdot N \cdot L\big) \]For a realistic run — \(V = 100{,}000\), \(N \approx 10^6\) unique words, \(L \approx 5\) symbols on average — that is \(5 \times 10^{11}\) pair-touches, hours to days on a single machine. The waste is obvious if you stare at the loop: every iteration recounts pairs in words the merge never touched. Most words don't contain \((A,B)\). Their pair counts don't change. We shouldn't be visiting them at all.
Two extra data structures, maintained as a side effect of every merge:
| Structure | Type | What it stores |
|---|---|---|
pair_counts | dict (s1,s2) → int | Current global count of each adjacent pair, weighted by word frequency. |
pair_to_words | dict (s1,s2) → set[word_id] | Reverse index: which words currently contain this pair at least once. |
Both are built once at startup in a single pass over the corpus. After that they are never rebuilt — they are patched as merges happen. The training loop becomes:
# Production BPE training (sketch)
build_pair_counts_and_index(words) # one-shot O(N·L)
while len(vocab) < V:
(A, B) = argmax(pair_counts) # or pop from max-heap
AB = new_symbol(A, B)
vocab.add(AB); merges.append((A, B))
for w in pair_to_words[(A, B)]: # ONLY the affected words
apply_merge_in_place(w, A, B, AB) # patches pair_counts + pair_to_words
del pair_counts[(A, B)]
del pair_to_words[(A, B)]
Two things to notice. First, the inner for ranges over only the words that actually contain the merge target — typically a small fraction of \(N\), shrinking further as the vocab grows. Second, apply_merge_in_place doesn't just rewrite the word; it surgically updates pair_counts and pair_to_words so the next iteration's argmax is already correct. There is no separate recount step.
For argmax(pair_counts) over hundreds of thousands of pairs, a linear scan each iteration is the next bottleneck. Production implementations swap it for a max-heap keyed on pair count, with lazy deletion: when we pop a pair whose stored count doesn't match the current pair_counts value, it's a stale entry — discard and pop the next. Stale entries are cheap; they replace the cost of keeping the heap perfectly synchronized.
This is the only subtle part of the algorithm. When we merge \((A, B) \to AB\) inside a single word's symbol sequence, the only pair counts that change are the ones touching the merged position. Concretely, for each occurrence of \((A, B)\) at position \(i\) in a word with frequency \(c_w\), name the neighbors:
\[ \ldots,\; \underbrace{X}_{i-1},\; \underbrace{A}_{i},\; \underbrace{B}_{i+1},\; \underbrace{Y}_{i+2},\; \ldots \quad\longrightarrow\quad \ldots,\; X,\; \underbrace{AB}_{i},\; Y,\; \ldots \]Three pairs disappear and two new pairs appear, each accounting for \(c_w\) of count:
| Old pair | Count change | New pair | Count change |
|---|---|---|---|
| \((X, A)\) | \(-c_w\) | \((X, AB)\) | \(+c_w\) |
| \((A, B)\) | \(-c_w\) | — (consumed) | — |
| \((B, Y)\) | \(-c_w\) | \((AB, Y)\) | \(+c_w\) |
That's the whole update. Whenever a pair's count drops, also remove the word from pair_to_words[old_pair] if it no longer contains that pair; whenever a new pair appears in this word, add the word to pair_to_words[new_pair]. Edge cases:
A B A B A B, the pair \((A, B)\) appears three times but they overlap with \((B, A)\) at the seams. Scan left-to-right and after each merge skip past the newly created \(AB\) so you don't try to re-merge inside it.A A A being merged on \((A, A)\) becomes AA A (not A AA — left-to-right is the convention) and the second \((A, A)\) is gone, not pending.In practice this is small and shrinks as training progresses — common pairs get merged early when they appear in many words, but once a pair is rare enough to be picked late it only lives in a handful of words. Empirically the whole training run is closer to \(\mathcal{O}(V \log V + \text{corpus size})\) than to the naive \(\mathcal{O}(V \cdot N \cdot L)\).
A toy corpus of three words. Step through one merge and watch the algorithm visit only the words it has to.
The merges in lower and newer follow the seam rule from §5: X=w, no Y (the r is the last symbol), so \((w, e)\) and \((e, r)\) shed \(c_w\) each and \((w, er)\) gains \(c_w\). Tie-break at the first step picks \((e, r)\) over \((w, e)\) by lexicographic order — a common convention.
Notice nothing happened to \((l,o)\), \((o,w)\), or \((n,e)\): the words containing those pairs were never visited. That is the win the reverse index buys.
Training gives us the merge table. Encoding a new piece of text means: pre-tokenize, convert each piece to bytes, then replay the merges in priority order until no more apply. Priority is the merge's rank in the table — rank 0 (the first merge learned, the most common pair in the corpus) wins over rank 100.
The naive encoder is \(\mathcal{O}(L^2)\) per piece: at each step, scan all adjacent pairs, find the one with the lowest rank, merge it, repeat. Fine for short pieces; awful when the regex pre-tokenizer hands you a long URL.
The production encoder uses a doubly-linked list of symbols plus a min-heap keyed on (rank, position):
# Production BPE encoding (sketch)
syms = doubly_linked_list(bytes_of(piece))
heap = []
for node in syms:
rank = merge_rank.get((node.val, node.next.val))
if rank is not None:
heappush(heap, (rank, node.id, node))
while heap:
rank, _, node = heappop(heap)
if node.deleted or node.next.deleted:
continue # lazy: stale entry
if (node.val, node.next.val) != merge_pair[rank]:
continue # neighbours changed under us
merge(node, node.next) # node now holds AB; node.next is unlinked
# the only new pairs are at the two seams
for seam in (node.prev, node):
if seam and seam.next:
r = merge_rank.get((seam.val, seam.next.val))
if r is not None:
heappush(heap, (r, seam.id, seam))
The same pattern as training: when something changes, only update what touches the change. Each symbol is merged at most \(\mathcal{O}(\log L)\) times into successively bigger tokens, so the encoder runs in \(\mathcal{O}(L \log L)\) per piece. Pre-tokenizing into short pieces first means \(L\) is small anyway — usually a single word.
Before BPE ever runs, both training and encoding split text on a regex that forces certain boundaries — between letters and digits, between words and punctuation, and (crucially) between leading whitespace and the next word. GPT-2's regex is the canonical one:
's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
BPE then operates inside each piece. Without pre-tokenization, BPE would happily learn merges that span word boundaries (" the" + " of" as one token) which destroys generalization and bloats the vocab with redundant whitespace-prefixed variants. The regex is what makes " world" and "world" two clean alternatives instead of arbitrary surface forms.
Starting from raw bytes (256 initial symbols) guarantees the tokenizer can encode anything — no <UNK>, no out-of-vocabulary character, no SentencePiece-style fallback to character-level for rare Unicode. The cost is that one emoji might span four tokens before any merges are learned for it; the benefit is that the tokenizer is total: every byte string has exactly one encoding.
Tokens like <|endoftext|> are not learnable merges. They are reserved IDs added to the vocabulary at the start, and the encoder splits the input on their literal strings before pre-tokenization, so they survive intact. This is why injecting <|endoftext|> from user-controlled text is a real security concern — the tokenizer treats it as a single control token regardless of where it appears.
Vocabulary + ordered merge table. Order = priority.
\(\mathcal{O}(V \cdot N \cdot L)\) — recounts everything each step.
Merging \((A,B)\) only changes pair counts at the seams \((X,A)\), \((A,B)\), \((B,Y)\).
Pair → set of words containing it, so a merge visits only affected words.
Lazy-deletion heap on pair counts replaces the per-step argmax scan.
Linked list + min-heap on merge rank → \(\mathcal{O}(L \log L)\) per piece.
Regex split first so merges never cross word boundaries.
256 initial symbols → no UNK, every byte string encodable.
| When you see… | Recognize it as |
|---|---|
| "BPE retrains in minutes, not hours" | Incremental pair counts + reverse index |
A merges.txt file with one pair per line | The ordered merge table — line number = rank = priority |
| A heap entry that gets popped and skipped | Lazy deletion: a count or pair changed since it was queued |
| The GPT-2 regex in a tokenizer's source | The pre-tokenization step that runs before BPE |
| An emoji that became four tokens | No merges learned for it yet — byte-level fallback in action |