Production BPE — Efficient Training & Encoding

BPE training is conceptually four lines: count adjacent symbol pairs, merge the most frequent one, repeat until the vocabulary is full. A textbook implementation following that recipe literally is hundreds of times too slow to train on a real corpus. This page walks through the optimization that makes BPE production-grade — keeping pair counts incremental so a merge only touches the words it actually changes — and then does the same for inference-time encoding.

1. What BPE produces

A trained BPE tokenizer is two artifacts:

Vocabulary — a list of token IDs. For byte-level BPE this starts as the 256 single bytes (plus any special tokens) and grows by one each merge.
Merge table — an ordered list of merges \((A, B) \to AB\). Order matters: earlier merges have higher priority at encoding time, because they were the most frequent pairs in the training corpus.

Training learns the merge table. Encoding replays it on new text. We tackle them in that order.

Symbol	Description
\(V\)	Target vocabulary size
\(N\)	Number of distinct words (pre-tokens) in the corpus
\(L\)	Average length of a word in symbols
\(c_w\)	Frequency of word \(w\) in the corpus
\(P\)	Number of distinct adjacent pairs currently in play
\((A,B) \to AB\)	A merge: replace the adjacent pair \(AB\) by a new symbol

2. The naive training algorithm

Start each word as a sequence of bytes. Aggregate identical words with their counts so a 100-million-token corpus becomes a few million unique (word, count) pairs. Then loop:

# Naive BPE training
words = {w: (bytes(w), count) for w, count in word_counts.items()}
vocab = set(initial_bytes)
merges = []

while len(vocab) < V:
    # (a) recount every adjacent pair across the whole corpus
    pair_counts = defaultdict(int)
    for w, (syms, c) in words.items():
        for a, b in zip(syms, syms[1:]):
            pair_counts[(a, b)] += c

    # (b) pick the most frequent pair
    (A, B), _ = max(pair_counts.items(), key=lambda kv: kv[1])
    AB = A + B
    vocab.add(AB)
    merges.append((A, B))

    # (c) rewrite every word, replacing (A,B) with AB everywhere
    for w in words:
        syms, c = words[w]
        words[w] = (replace_pair(syms, A, B, AB), c)

That is the recipe in the original Sennrich et al. paper, written almost verbatim. It is correct. It is also unusable on real corpora.

3. Why naive BPE is too slow

Per merge, step (a) walks every adjacent pair in every word: \(\mathcal{O}(N L)\). Step (c) rewrites every word: another \(\mathcal{O}(N L)\). We do \(V\) of these. The total is:

\[ T_{\text{naive}} \;=\; \mathcal{O}\!\big(V \cdot N \cdot L\big) \]

For a realistic run — \(V = 100{,}000\), \(N \approx 10^6\) unique words, \(L \approx 5\) symbols on average — that is \(5 \times 10^{11}\) pair-touches, hours to days on a single machine. The waste is obvious if you stare at the loop: every iteration recounts pairs in words the merge never touched. Most words don't contain \((A,B)\). Their pair counts don't change. We shouldn't be visiting them at all.

The dominant cost is step (a) — recounting from scratch. Even step (c) can be made fast if we know which words contain the pair we're merging. The whole production algorithm follows from making that "which words" lookup cheap.

4. The production algorithm: incremental counts + reverse index

Two extra data structures, maintained as a side effect of every merge:

Structure	Type	What it stores
`pair_counts`	dict `(s1,s2) → int`	Current global count of each adjacent pair, weighted by word frequency.
`pair_to_words`	dict `(s1,s2) → set[word_id]`	Reverse index: which words currently contain this pair at least once.

Both are built once at startup in a single pass over the corpus. After that they are never rebuilt — they are patched as merges happen. The training loop becomes:

# Production BPE training (sketch)
build_pair_counts_and_index(words)   # one-shot O(N·L)

while len(vocab) < V:
    (A, B) = argmax(pair_counts)                 # or pop from max-heap
    AB = new_symbol(A, B)
    vocab.add(AB); merges.append((A, B))

    for w in pair_to_words[(A, B)]:        # ONLY the affected words
        apply_merge_in_place(w, A, B, AB)        # patches pair_counts + pair_to_words

    del pair_counts[(A, B)]
    del pair_to_words[(A, B)]

Two things to notice. First, the inner for ranges over only the words that actually contain the merge target — typically a small fraction of \(N\), shrinking further as the vocab grows. Second, apply_merge_in_place doesn't just rewrite the word; it surgically updates pair_counts and pair_to_words so the next iteration's argmax is already correct. There is no separate recount step.

For argmax(pair_counts) over hundreds of thousands of pairs, a linear scan each iteration is the next bottleneck. Production implementations swap it for a max-heap keyed on pair count, with lazy deletion: when we pop a pair whose stored count doesn't match the current pair_counts value, it's a stale entry — discard and pop the next. Stale entries are cheap; they replace the cost of keeping the heap perfectly synchronized.

5. The merge-update rule

This is the only subtle part of the algorithm. When we merge \((A, B) \to AB\) inside a single word's symbol sequence, the only pair counts that change are the ones touching the merged position. Concretely, for each occurrence of \((A, B)\) at position \(i\) in a word with frequency \(c_w\), name the neighbors:

\[ \ldots,\; \underbrace{X}_{i-1},\; \underbrace{A}_{i},\; \underbrace{B}_{i+1},\; \underbrace{Y}_{i+2},\; \ldots \quad\longrightarrow\quad \ldots,\; X,\; \underbrace{AB}_{i},\; Y,\; \ldots \]

Three pairs disappear and two new pairs appear, each accounting for \(c_w\) of count:

Old pair	Count change	New pair	Count change
\((X, A)\)	\(-c_w\)	\((X, AB)\)	\(+c_w\)
\((A, B)\)	\(-c_w\)	— (consumed)	—
\((B, Y)\)	\(-c_w\)	\((AB, Y)\)	\(+c_w\)

That's the whole update. Whenever a pair's count drops, also remove the word from pair_to_words[old_pair] if it no longer contains that pair; whenever a new pair appears in this word, add the word to pair_to_words[new_pair]. Edge cases:

Word boundary. If \(A\) is the first symbol, there is no \(X\): skip the \((X, A) \to (X, AB)\) line. Same for \(B\) at the end.
Overlapping occurrences. In a word like A B A B A B, the pair \((A, B)\) appears three times but they overlap with \((B, A)\) at the seams. Scan left-to-right and after each merge skip past the newly created \(AB\) so you don't try to re-merge inside it.
Adjacent merges of the same pair. A word like A A A being merged on \((A, A)\) becomes AA A (not A AA — left-to-right is the convention) and the second \((A, A)\) is gone, not pending.

Per-merge cost

\[ \mathcal{O}\!\big(\,|\text{words containing }(A,B)| \;\times\; \text{occurrences per word}\,\big) \]

In practice this is small and shrinks as training progresses — common pairs get merged early when they appear in many words, but once a pair is rare enough to be picked late it only lives in a handful of words. Empirically the whole training run is closer to \(\mathcal{O}(V \log V + \text{corpus size})\) than to the naive \(\mathcal{O}(V \cdot N \cdot L)\).

6. Worked example

A toy corpus of three words. Step through one merge and watch the algorithm visit only the words it has to.

▸ One BPE merge step on a toy corpus Pair counts update live as the merge ripples through.

low low ×5

lower lower ×2

newer newer ×6

Pair counts

(l, o)7

(o, w)7

(w, e)8

(e, r)8

(n, e)6

Step 0/5 — Initial state. Find the max pair.

The merges in lower and newer follow the seam rule from §5: X=w, no Y (the r is the last symbol), so \((w, e)\) and \((e, r)\) shed \(c_w\) each and \((w, er)\) gains \(c_w\). Tie-break at the first step picks \((e, r)\) over \((w, e)\) by lexicographic order — a common convention.

Notice nothing happened to \((l,o)\), \((o,w)\), or \((n,e)\): the words containing those pairs were never visited. That is the win the reverse index buys.

7. Encoding new text

Training gives us the merge table. Encoding a new piece of text means: pre-tokenize, convert each piece to bytes, then replay the merges in priority order until no more apply. Priority is the merge's rank in the table — rank 0 (the first merge learned, the most common pair in the corpus) wins over rank 100.

The naive encoder is \(\mathcal{O}(L^2)\) per piece: at each step, scan all adjacent pairs, find the one with the lowest rank, merge it, repeat. Fine for short pieces; awful when the regex pre-tokenizer hands you a long URL.

The production encoder uses a doubly-linked list of symbols plus a min-heap keyed on (rank, position):

# Production BPE encoding (sketch)
syms = doubly_linked_list(bytes_of(piece))
heap = []
for node in syms:
    rank = merge_rank.get((node.val, node.next.val))
    if rank is not None:
        heappush(heap, (rank, node.id, node))

while heap:
    rank, _, node = heappop(heap)
    if node.deleted or node.next.deleted:
        continue                              # lazy: stale entry
    if (node.val, node.next.val) != merge_pair[rank]:
        continue                              # neighbours changed under us

    merge(node, node.next)                    # node now holds AB; node.next is unlinked

    # the only new pairs are at the two seams
    for seam in (node.prev, node):
        if seam and seam.next:
            r = merge_rank.get((seam.val, seam.next.val))
            if r is not None:
                heappush(heap, (r, seam.id, seam))

The same pattern as training: when something changes, only update what touches the change. Each symbol is merged at most \(\mathcal{O}(\log L)\) times into successively bigger tokens, so the encoder runs in \(\mathcal{O}(L \log L)\) per piece. Pre-tokenizing into short pieces first means \(L\) is small anyway — usually a single word.

Tiktoken's hot path goes further: for each pre-token it materializes the byte sequence in an array and runs the merges with a tight Rust loop, skipping the linked list entirely because the pieces are short. The asymptotic story is the same; the constants are what tiktoken optimizes.

8. Pre-tokenization, byte fallback, special tokens

Pre-tokenization

Before BPE ever runs, both training and encoding split text on a regex that forces certain boundaries — between letters and digits, between words and punctuation, and (crucially) between leading whitespace and the next word. GPT-2's regex is the canonical one:

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

BPE then operates inside each piece. Without pre-tokenization, BPE would happily learn merges that span word boundaries (" the" + " of" as one token) which destroys generalization and bloats the vocab with redundant whitespace-prefixed variants. The regex is what makes " world" and "world" two clean alternatives instead of arbitrary surface forms.

Byte-level alphabet

Starting from raw bytes (256 initial symbols) guarantees the tokenizer can encode anything — no <UNK>, no out-of-vocabulary character, no SentencePiece-style fallback to character-level for rare Unicode. The cost is that one emoji might span four tokens before any merges are learned for it; the benefit is that the tokenizer is total: every byte string has exactly one encoding.

Special tokens

Tokens like <|endoftext|> are not learnable merges. They are reserved IDs added to the vocabulary at the start, and the encoder splits the input on their literal strings before pre-tokenization, so they survive intact. This is why injecting <|endoftext|> from user-controlled text is a real security concern — the tokenizer treats it as a single control token regardless of where it appears.

9. Summary

Outputs

Vocabulary + ordered merge table. Order = priority.

Naive cost

\(\mathcal{O}(V \cdot N \cdot L)\) — recounts everything each step.

Key insight

Merging \((A,B)\) only changes pair counts at the seams \((X,A)\), \((A,B)\), \((B,Y)\).

Reverse index

Pair → set of words containing it, so a merge visits only affected words.

Max-heap

Lazy-deletion heap on pair counts replaces the per-step argmax scan.

Encoding

Linked list + min-heap on merge rank → \(\mathcal{O}(L \log L)\) per piece.

Pre-tokenize

Regex split first so merges never cross word boundaries.

Byte-level

256 initial symbols → no UNK, every byte string encodable.

When you see…	Recognize it as
"BPE retrains in minutes, not hours"	Incremental pair counts + reverse index
A `merges.txt` file with one pair per line	The ordered merge table — line number = rank = priority
A heap entry that gets popped and skipped	Lazy deletion: a count or pair changed since it was queued
The GPT-2 regex in a tokenizer's source	The pre-tokenization step that runs before BPE
An emoji that became four tokens	No merges learned for it yet — byte-level fallback in action