Models can't read text. They process numbers. A tokenizer is the bridge — it converts text → integer IDs, and back.
Text: "Hello world"
Tokens: ["Hello", " world"] ← human-readable view
IDs: [9906, 1917] ← what the model actually sees
Without a tokenizer, the model couldn't process language at all. Every large language model — GPT, Claude, Llama, Qwen — has one.
Why "tokens" and not letters or words?
You could split text three ways.
By character. "Hello" → H, e, l, l, o (five tokens). The vocabulary is tiny — just letters — but the model has to learn from scratch that the sequence H, e, l, l, o means hello. That wastes capacity on what is essentially a separate spelling task. Bad efficiency.
By word. "Hello" → Hello (one token). Efficient when it works, but the vocabulary explodes — every misspelling, every name, every new word needs its own ID. Foreign languages, with their compounding and inflection, blow it apart.
By subword (BPE). "unforgivable" → un, forgive, able (three tokens). The vocabulary stays a reasonable size — for us, around 200,000 entries — but covers any text in any language. Rare words decompose into common subparts. Common words are still single tokens. This is what we do, and what almost every modern model does.
The downside: training the tokenizer is expensive, and the result is locked forever once a model is trained on top of it.
How BPE builds itself
It's a frequency game.
- Start with every letter or byte as its own token. Roughly 256 entries to begin with.
- Scan the training corpus. Find the most common adjacent pair — say
("e", "r")— which appears millions of times. - Merge them into a single token:
er. Add it to the vocabulary. - Repeat. Soon
erandingexist. Thenering. Thenengineer. Thenengineering. - Stop when the vocabulary reaches the target size. Two hundred thousand, in our case.
What gets merged depends entirely on the training corpus. Train on English-only data and the tokenizer becomes English-shaped — it treats other languages as random byte sequences. Train on a balanced multilingual corpus and the vocabulary learns to cover them.
Why "fertility" matters
Fertility is how many tokens it takes to encode one character of a language.
English: "The cat sits" → 3 tokens → 0.25 tokens/char (efficient)
Swedish: "Katten sitter" → 4 tokens → 0.29 tokens/char (efficient)
Chinese: "猫坐着" → 9 tokens → 3.0 tokens/char (inefficient)
Why is high fertility bad? Two reasons.
Context window loss. A model with a 4,096-token context can read ~16,000 characters of English but only ~1,365 characters of Chinese at three tokens per character. The Chinese user gets, in effect, a third of the model's attention span — for the same prompt, the same task, the same prose.
Training inefficiency. Each gradient step processes one batch of tokens. High-fertility languages need three to four times more compute to learn the same meaning. They train slower, and they train worse, for the same budget.
A tokenizer with uneven fertility across the languages it claims to serve is, quietly, an unfair tokenizer.
Why it's "locked" once trained
Once you train a model with a tokenizer, every weight in the model is tuned to expect that exact tokenizer's IDs. Token #1862 means The to that model. Forever.
If you change the tokenizer afterwards:
- Token
#1862might now meanprefix. - Every weight in the model now expects the wrong meaning.
- Output becomes garbage.
So: train the tokenizer once, train the model on top of it, never change the tokenizer. That's why our internal doctrine says "locked after the merge of PR-A02." For Kapllan-K, we will train K-0 through K-4 — five generations of models — all on the same tokenizer. The decision we make today binds 22 months of model training that follows.
Why this is hard
Multilingual fairness sounds like a soft requirement until you try to engineer it. It demands three things at once:
- A balanced training corpus. Roughly equal share of each language, so BPE actually learns merges for each. Without this, low-resource languages just get byte-level fragments.
- The right preprocessing. Chinese needs word segmentation (we use jieba); Japanese needs morphological analysis (sudachi); Korean benefits from explicit Hangul syllable splits. Without these, BPE wastes vocabulary slots on byte-level rubble that means nothing to a reader.
- The right algorithm and library. SentencePiece is single-threaded at our scale and would take more than ten days. HuggingFace's
tokenizersparallelizes properly and finishes in hours.
Doing all three at once, on commodity hardware, is finicky. That is the journey we have been on for the last ten hours.
What v2 specifically fixes
| v1 (committed, FAIL) | v2 (training now) | |
|---|---|---|
| Algorithm | BPE | BPE |
| Vocabulary size | 200,000 | 200,000 |
| Chinese preprocessing | none (raw bytes) | jieba segmentation |
| Japanese preprocessing | none | sudachi |
| Korean preprocessing | none | Hangul syllable split |
| CJK content in training | ~0.4% (under-sampled) | ~30% (full fineweb-2-zh/ja/ko) |
| Chinese fertility | 0.71 (3.01× median, FAIL) | 0.30–0.40 (1.3–1.7× median, hopefully PASS) |
| Locks the tokenizer? | yes, if accepted | yes, if it PASSES |
If v2 passes the doctrine fairness gate, it becomes the locked Kapllan-K tokenizer for K-0 through K-4 — the 200,000-vocab BPE that every model in the program will use, unchanged, for as long as the program runs.
That is why we are spending three hours and €5 to get this right. Cheap insurance for a foundation that has to last almost two years of model training.
Why we don't just use Llama's tokenizer
We considered it. We rejected it for three reasons.
- Llama's tokenizer doesn't reserve our 71 special tokens —
<plan>,<think>,<image>, and the rest of our control vocabulary. - Llama's tokenizer wasn't trained on our Nordic-boosted mix. Swedish, Norwegian, Danish, Finnish, and Albanian fertility would all be worse than what we can train ourselves.
- Using someone else's tokenizer means our future models can't differentiate on multilingual capability. We would inherit their compromises and ship them as our own.
So we train our own. A custom tokenizer is a first-class regional differentiator, and the easiest place to bake one in is at the very beginning, before any weight has yet been written.
Currently v2 is at 600,000 of 2.7 million lines processed. Wakeup at 00:47 to check.