Tokens Deep Dive: How the Sausage Gets Made

The Algorithm

This is the deep dive companion to episode one of Actually, AI, tokens.

In the main episode, we said that Byte Pair Encoding finds the most common pair and merges it. That is accurate, but it hides how surprisingly simple the whole thing is. The algorithm that determines how every modern AI reads your text fits on a napkin. Here is how a tokenizer vocabulary gets built.

You start with the raw text from your training data, broken into individual bytes. Two hundred and fifty-six possible values. That is your base vocabulary. Now you count every adjacent pair. If the pair "t" followed by "h" appears more often than any other pair, you merge them into a new token, "th." You add "th" to your vocabulary. You go back through the entire training data, replace every occurrence of "t" followed by "h" with your new "th" token, and count all the pairs again. Next round, maybe "th" followed by "e" is the most common. Merge those into "the." Add it to the vocabulary. Replace all occurrences. Count again. Repeat.

After fifty thousand rounds of this, you have a vocabulary of two hundred and fifty-six base bytes plus fifty thousand merged tokens plus a handful of special tokens. For GPT-2, the final count was fifty thousand two hundred and fifty-seven. That number is not arbitrary. Two hundred and fifty-six byte tokens, plus fifty thousand learned merges, plus one special end-of-text marker. Every model in the GPT family descends from this design choice.

The elegance is that common patterns compress naturally. "The" appears so often that it gets merged early and becomes a single token. "Quantum" appears less often and might stay as two tokens. "Xylophone" is rare enough that it might be four or five tokens. The vocabulary becomes an implicit frequency map of the training data. And here is the key insight that Sennrich brought to natural language processing in twenty fifteen: this is not just compression anymore. The merge boundaries happen to land on linguistically meaningful breaks. Prefixes like "un" and "re" often merge into their own tokens. Common suffixes like "ing" and "tion" do the same. The algorithm has no concept of morphology. It does not know what a prefix is. But because prefixes appear in consistent patterns across many words, the frequency statistics capture them automatically. The structure of language leaks through.

The Rivals

Byte Pair Encoding is not the only game in town. It is the dominant one, used by GPT, Llama, Mistral, and Gemma. But Google went a different direction with BERT, and the difference is instructive.

In twenty twelve, Mike Schuster and Kaisuke Nakajima at Google were working on Japanese and Korean voice search. Japanese does not use spaces between words, which makes splitting text into units fundamentally harder than in English. They developed WordPiece, an algorithm that looks similar to BPE but makes a subtly different choice at each step. Where BPE merges whichever pair appears most frequently, a greedy count, WordPiece merges whichever pair is most "informative." It calculates a score: how often does this pair appear together, divided by how often each piece appears separately? A pair that shows up together far more than chance would predict gets merged first, even if another pair has a higher raw count.

The practical difference is subtle. BPE tends to merge the most common English words first. WordPiece tends to merge pairs that are more linguistically meaningful, pairs that represent real patterns rather than just high frequency. BERT and its descendants use WordPiece. Most of the generative models that power chatbots use BPE.

Then in twenty eighteen, Taku Kudo and John Richardson at Google published SentencePiece, which solved a different problem entirely. Both BPE and WordPiece assume the input text has already been split into words by whitespace. For English, that works. For Chinese, Japanese, and Thai, languages written without spaces, it does not work at all. SentencePiece treats the input as a raw stream of characters with no pre-splitting. It replaces spaces with a special marker character and runs the merging algorithm on the entire unbroken text. This makes it truly language-independent. T5 and Llama use SentencePiece. It can run either BPE or an alternative algorithm called Unigram under the hood.

Unigram, also by Kudo, works in the opposite direction from BPE. Instead of starting with individual characters and merging up, it starts with an enormous vocabulary and prunes down. At each step, it removes the tokens whose absence would increase the overall encoding cost the least. And unlike BPE, which always produces the same tokenization for a given string, Unigram is probabilistic. It can produce multiple different valid tokenizations for the same input and sample between them. This turns out to be useful during model training, because it forces the model to be robust to different ways of splitting the same text.

The choice between these algorithms matters less than you might think. All of them produce subword vocabularies that look broadly similar. The real decisions that shape model behavior are the vocabulary size and, critically, what training data the tokenizer sees.

The People Who Broke ChatGPT by Counting

In January twenty twenty-three, Jessica Rumbelow and Matthew Watkins were working on AI safety research when they stumbled onto something they could not explain. Rumbelow was a PhD candidate at the University of St Andrews and a scholar at SERI-MATS, a mentorship program for AI safety researchers. Watkins was an honorary research fellow at Exeter University with a career in number theory. He plays the saz, a seven-stringed Turkish instrument. He spent the late nineteen nineties as a nomadic musician, busking and picking fruit and visiting megalithic sites across Europe. He wrote a trilogy of popular books about prime numbers. He was not, by any conventional measure, an AI researcher.

They were using a technique called k-means clustering on the embedding space of GPT models, looking for patterns in how the model organized its internal representations of tokens. Most clusters made sense. Tokens for numbers grouped together. Tokens for names grouped together. But some clusters were bizarre. They contained tokens that seemed to have no coherent meaning, tokens that the model treated as if they were radioactive.

When they fed these tokens to ChatGPT and asked it to simply repeat them, the model lost its mind. The token "SolidGoldMagikarp" caused the model to respond as if asked to repeat the word "distribute." "TheNitromeFan" came back as "one hundred and eighty-two." "guiActiveUn" became "reception." The model was not just failing to repeat these strings. It was hallucinating completely unrelated words, evading the question, generating insults, and in some cases producing strings of nonsense. They had found three hundred and seventy-four of these broken tokens. They called them glitch tokens.

I have just found out that several of the anomalous tokens, TheNitromeFan, SolidGoldMagikarp, davidjl, Smartstocks, RandomRedditorWithNo, are handles of people who are, competitively or collaboratively, counting to infinity on a Reddit forum. I kid you not.

The subreddit is called r slash counting. It is exactly what it sounds like. Users take turns posting the next number. They have been at it for nearly a decade. They have reached almost five million. There is a leaderboard. There is a hall of fame. And six of the top counters' usernames had become glitch tokens in GPT.

The explanation is a collision between two separate systems. The tokenizer was trained on a massive web scrape that included Reddit. These users had posted tens of thousands of comments each, nothing but numbers, but their usernames appeared so frequently that the tokenizer gave each one its own dedicated token. Then the model itself was trained on different data, data where these usernames appeared rarely or never. The result was tokens that existed in the vocabulary, had their own slot in the embedding table, but had never been meaningfully trained. Their embeddings were essentially random noise. When the model encountered them, it had no learned associations to fall back on and produced chaotic output.

A number theorist and an AI safety researcher discovered that people who count to infinity on Reddit had accidentally broken the world's most famous chatbot. The story tells you something important about how these systems are built. The tokenizer and the model are separate. They are trained on different data, at different times, with different goals. The tokenizer optimizes for compression. The model optimizes for prediction. And when those two systems disagree about what matters, strange things happen in the gap. Rumbelow went on to found Leap Laboratories, a startup focused on AI interpretability. Watkins went back to studying prime numbers. The glitch tokens were quietly patched out in later model versions.

The Tax in Detail

In the main episode, we mentioned the multilingual tokenization tax. The deep dive version has numbers that are harder to wave away.

A research team led by Aleksandar Petrov evaluated seventeen different tokenizers and found that the same sentence, same meaning, same information, can require up to fifteen times as many tokens depending on the language. They used the FLORES-200 parallel corpus, two thousand sentences translated professionally into two hundred languages, so the comparison is apples to apples. Same content. Different cost.

The mechanism is straightforward. BPE merges the most frequent pairs first. English dominates the training data, so English character patterns get merged early and aggressively. Common English words become single tokens. Common Hindi words, which use a completely different script and appeared far less often in the training data, stay fragmented. The word for "hello" in English is one token. In Hindi, "namaste" is four. That four-to-one ratio compounds across everything: every API call, every context window, every dollar spent.

A twenty twenty-five study using an African languages benchmark found that doubling the token count for a given piece of text quadrupled the training cost and time. Token fertility, the number of tokens per word, reliably predicted model accuracy. Higher fertility meant lower accuracy. The tokenizer was not just making non-English languages more expensive. It was making the model worse at them.

Consider the economics at scale. An Indian company processing one hundred million words per month through GPT-4o pays roughly four hundred and seventy-three dollars, compared to two hundred and ninety dollars for an equivalent American company processing the same volume in English. That is a two thousand one hundred and ninety-six dollar annual difference. Not ruinous for one company, but multiply it across India's five hundred and forty-four million monthly ChatGPT visits, the second-largest market after the United States, and the structural disadvantage becomes significant. Indonesia, the fifth-largest market at two hundred and sixteen million visits, faces similar ratios.

The hopeful part: it is getting better. GPT-4o's expanded vocabulary of two hundred thousand tokens was specifically designed to address this. Malayalam efficiency improved by a factor of four. Kannada token counts dropped by seventy-nine percent. Hindi has improved seventy-one percent since twenty twenty-one. But Kashmiri improved by only thirty-eight percent, and Manipuri showed no improvement at all. The fix is uneven, and even the improved languages still pay more than English.

There is a deeper question underneath the economics. When the tokenizer fragments a Hindi sentence into many small pieces, the model has to use some of its attention capacity just to reassemble those fragments into meaningful units. Attention spent on reassembly is attention not spent on understanding. The model is literally doing extra work before it can even begin to think about what you said. The tax is not just financial. It is cognitive.

Everything the Model Cannot See

Karpathy's list of tokenization failures from his twenty twenty-four lecture deserves deeper examination, because each failure reveals something about the boundary between what the model perceives and what it does not.

The spelling problem is not just an amusing quirk. Researchers have found that large language models are "not inherently character-aware." When a model spells a word correctly, it is not extracting characters from the word the way you would. It is performing a learned retrieval task, matching the token to a memorized spelling pattern from training data. This works for common words and fails for uncommon ones. The model can spell "hello" because it saw "h-e-l-l-o" thousands of times in its training data. It struggles with "xylophone" because the token chunks do not align with individual letters and the spelling was not drilled as often.

Case sensitivity creates surprising asymmetries. The word "hello" in lowercase is one token. "Hello" with a capital H is a different token entirely, a different number in the vocabulary with a different embedding. "HELLO" in all capitals breaks into three tokens. All-caps text is literally harder for the model to process than lowercase, because the tokens are shorter and more numerous. The model must spend more attention steps to reconstruct the same word.

The whitespace problem is equally subtle. The word "world" at the start of a sentence is a different token than "world" in the middle of a sentence, because the leading space becomes part of the token. The string " world," with a space before it, is one token. The string "world" alone is a different token. Without a workaround called "add dummy prefix," the model would see fundamentally different representations of the same word depending on where it appears. The same word, the same meaning, the same six letters, represented differently to the model because of an invisible space character.

And then there are emoji. A simple face emoji requires four bytes in UTF-8 encoding. A family emoji, the one showing two parents and two children, is actually multiple Unicode characters joined by invisible zero-width joiner characters. It can tokenize into a dozen or more tokens. Some models generate emoji byte by byte, which means the model is producing a sequence of tokens that have no meaning individually, tokens that only become an emoji when decoded back to text. The model is assembling a picture one random-looking byte at a time, hoping the sequence adds up to a smiling face.

The Hunt for Something Better

Karpathy's wish to "delete this stage entirely" is not idle dreaming. Multiple research groups are trying to build language models that operate directly on raw bytes, bypassing tokenization altogether.

The most prominent effort is Meta's Byte Latent Transformer, published in twenty twenty-four. Instead of pre-tokenizing text into fixed pieces, it dynamically groups bytes into patches based on the information content of each byte. Simple, predictable sequences get long patches. Complex, information-dense sequences get short patches. The average patch size is about four bytes, comparable to subword tokens, but the boundaries are adaptive rather than fixed. On benchmarks, it matches the performance of standard transformer models trained on the same data.

SpaceByte takes a different approach, significantly outperforming standard byte-level transformers and matching subword transformer performance on several benchmarks. ByteFlow, published in twenty twenty-six, uses neural networks to determine where to place chunk boundaries dynamically, letting the model itself learn where the meaningful breaks are.

All of these approaches converge on the same architectural insight: you need a hierarchy. Process bytes at a low level, then build higher-level representations from them. The question everyone is trying to answer is where to draw the line between levels. Do you let information theory decide, the way the Byte Latent Transformer does? Do you let the model learn, the way ByteFlow does? Do you use linguistic heuristics like whitespace?

The obstacle is raw computation. Byte-level sequences are four to six times longer than token sequences. The attention mechanism in a standard transformer is quadratic in sequence length, meaning a six times longer sequence costs roughly thirty-six times more compute. All the tokenizer-free approaches have to find a way around this, either by modifying attention itself, by processing bytes in chunks, or by being clever about which bytes attend to which other bytes. None of them have fully solved the problem at the scale of the largest commercial models.

So tokenization persists. Not because it is good, but because it is good enough and the alternatives are not yet cheap enough. Philip Gage's nineteen ninety-four compression trick, repurposed by Rico Sennrich's twenty fifteen linguistic insight, remains the foundation. Thirty-two years and counting.

The Jargon Jar

This episode's term: vocabulary.

In everyday English, your vocabulary is the set of words you know. In AI, the vocabulary is the fixed set of token pieces the model can recognize. That is it. A numbered list of fragments. Everything you type gets broken down until every piece matches something in that list. If a word is common enough, it is a single entry in the vocabulary. If it is rare, it gets split into smaller entries. Nothing can exist outside the vocabulary. There is no "I do not know this word" in the token world, only "I will split this into pieces I do recognize."

When a company says their model has a two hundred thousand token vocabulary, marketing wants you to think: bigger is better, more capable, more languages. What it actually means is that the tokenizer was trained with two hundred thousand merge operations instead of fifty thousand. More merges means more common patterns get their own dedicated token, which means shorter sequences for the same text, which means the model processes input faster and fits more text into its context window. But it also means a larger embedding table, which takes more memory and more compute. There are diminishing returns. The jump from thirty thousand to fifty thousand tokens makes a real difference. The jump from one hundred thousand to one hundred fifty thousand is barely noticeable. Two hundred thousand is not four times better than fifty thousand. It is maybe twenty percent more efficient for the languages that benefited from the expanded merges, and identical for English.

That was the deep dive for episode one.