Tokens: The Pieces You Never See

The Strawberry Problem

This is episode one of Actually, AI.

Ask any modern AI how many R's are in the word strawberry, and it will tell you two. The answer is three. This is not a hard question. A five year old can do it. The most sophisticated language models ever built, trained on trillions of words at a cost of hundreds of millions of dollars, cannot count the letters in a piece of fruit. And the reason is not some deep flaw in the architecture, not some philosophical limitation of machine intelligence. The reason is that the AI never saw the letters in the first place.

You type a sentence into ChatGPT or Claude or Gemini, and you assume the model reads your words. It does not. Before your message reaches the neural network, before any of the intelligence happens, a separate system chops your text into fragments. Not words. Not syllables. Not any unit a human would recognize. Fragments that were chosen by a statistical compression algorithm, optimized for efficiency, with zero regard for meaning. These fragments are called tokens. And they shape everything that follows.

The word strawberry does not enter the model as S, T, R, A, W, B, E, R, R, Y. It enters as something like "str," "aw," "berry." Three chunks. The model can see that there are three chunks. It can process relationships between those chunks. But it cannot see inside them, any more than you can read the individual atoms in a printed letter. When you ask about the R's, the model is trying to answer a question about information it simply does not have access to. It is like asking someone to count the nails in a house by looking at a photograph.

A Compression Trick from Colorado

The story of how AI came to read this way starts in nineteen ninety-four, in Colorado Springs, with a software engineer named Philip Gage. Gage was not working on artificial intelligence. He was not working on language. He was working on data compression, the old-fashioned kind, making files smaller so they would fit on floppy disks and transfer faster over slow modems.

Gage published a short article in C Users Journal, a niche magazine for working C programmers. Circulation was under ten thousand. The article described a simple algorithm he called Byte Pair Encoding. The idea was elegant. Take a sequence of data. Find the two adjacent bytes that appear together most often. Replace every occurrence of that pair with a single new symbol. Record what you did. Repeat. Each round, the most common pair gets compressed into one piece. After enough rounds, common patterns have been collapsed into short representations and the data is smaller.

Gage compared his method to the established Lempel-Ziv-Welch algorithm and found it provided almost as much compression with much simpler code. He published the algorithm, the C source code, and moved on with his career. And then he essentially vanished. There is a thread on Hacker News from twenty twenty-five titled "What Became of Philip Gage?" Nobody found an answer. No interviews, no social media, no conference appearances. The man who accidentally invented how every AI on Earth reads text is, as far as the public record is concerned, a ghost.

Twenty-one years later, in twenty fifteen, a computational linguist named Rico Sennrich was working at the University of Edinburgh on a problem that had nothing to do with compression. Neural machine translation systems had a fatal weakness. They worked from fixed vocabularies, and any word not in that vocabulary was replaced with a generic unknown token. This was catastrophic for rare words, for names, for morphologically rich languages like German and Turkish where a single concept might become a compound word the model had never seen before.

Sennrich had an insight that now looks obvious and was anything but. He took Gage's compression algorithm and changed what it operated on. Instead of merging frequent byte pairs to make data smaller, he merged frequent character pairs to build a vocabulary of word fragments. The algorithm was identical. The interpretation was different. Common words like "the" and "and" stayed whole, because their character sequences merged early. Rare words got split into recognizable pieces. A German compound the model had never seen before could be understood through its components. The problem of unknown words simply disappeared.

Sennrich, Barry Haddow, and Alexandra Birch published the paper at the Association for Computational Linguistics conference in twenty sixteen. It became one of the most cited natural language processing papers of the decade. When OpenAI built GPT-2 in twenty nineteen, they used a variant of Sennrich's method. When they built GPT-3, they used it again. When they built GPT-4, they used it again. So did Google for their models. So did Meta. So did Anthropic. A compression trick from a dead magazine, repurposed by a linguist in Edinburgh, became the universal foundation of how artificial intelligence reads.

The Machine That Cannot Spell

So what does this mean in practice? When you type a message to an AI, your text passes through a tokenizer before anything else happens. The tokenizer holds a fixed vocabulary, typically between fifty thousand and two hundred thousand token pieces. It scans your text from left to right, matching the longest piece it can find in its vocabulary, then moves on to the next segment. Common English words are usually single tokens. Less common words get split. Rare words, technical terms, and names from other languages get fragmented further.

The consequences ripple through everything the model does. Andrej Karpathy, the former director of AI at Tesla who co-founded OpenAI, published a two-hour lecture on tokenization in twenty twenty-four. He opened with a list.

Why can not the language model spell words? Tokenization. Why can not the language model do string processing like reversing a string? Tokenization. Why is it bad at non-English languages like Japanese? Tokenization. Why is it bad at simple arithmetic? Tokenization.

The arithmetic problem is particularly revealing. The number three hundred and eighty enters the model as a single token. Three hundred and eighty-one enters as two tokens, the digits three and eight as one piece and the digit one as the other. Three hundred and eighty-three goes back to one token. The model must learn to perform addition differently depending on how the specific numbers happen to be split by the tokenizer. Some additions are between single tokens that each represent a complete number. Others require the model to first mentally reassemble multi-digit numbers from fragments, and then compute. The difficulty of basic math depends not on the math itself but on the accidents of tokenization.

Karpathy ended his lecture with a wish.

Someone out there ideally finds a way to delete this stage entirely.

He was half joking. The problem is that operating on raw bytes instead of tokens would make input sequences four to six times longer. And the computational cost of the attention mechanism, the part of the AI that figures out which parts of the input matter for each part of the output, scales with the square of the sequence length. Six times longer inputs mean roughly thirty-six times more computation. Nobody has figured out how to make that practical at scale. Tokenization is a compromise. It is not the solution anyone would design from scratch, knowing what we know now. But it works well enough, and the alternatives are still too expensive.

The Language Tax

There is a darker consequence that goes beyond quirky spelling failures. The tokenizer was trained on internet text, and the internet is dominated by English. English patterns were the most common, so they got merged first, which means English gets the shortest, most efficient token sequences. Every other language pays a tax.

The word "hello" in English is one token. The Arabic greeting "marhaba" is two tokens. The Hindi greeting "namaste" is four. Same concept, same function, different cost. A researcher named Aleksandar Petrov and his colleagues found that the same sentence can require up to fifteen times as many tokens in some languages compared to English. Fifteen times. Same information, different price.

This matters because every major AI company charges by the token. More tokens for the same content means higher costs. It also means shorter effective context windows. If you are writing in English, a one hundred and twenty-eight thousand token context window holds roughly one hundred and ten thousand words. If you are writing in Hindi, that same window holds about sixty-eight thousand words. Thirty-eight percent less space for your thoughts, because the tokenizer was trained on data where Hindi was underrepresented.

India is ChatGPT's second-largest market. Indonesian users are the fifth-largest group. They are all paying more, getting less context, and receiving lower quality responses, because of a compression algorithm that was optimized on English text. This is not anyone's deliberate choice. It is a structural consequence of how Byte Pair Encoding works on unevenly distributed data. But the economic impact is real, and fixing it is slow. GPT-4o's expanded vocabulary of two hundred thousand tokens improved efficiency for Malayalam by a factor of four and reduced Hindi costs substantially. But Hindi speakers still pay sixty-three percent more than English speakers for equivalent work.

Where It All Starts

Every other episode in this series operates on tokens. When we talk about how AI understands meaning in episode six, we will be talking about how tokens get transformed into mathematical representations of meaning. When we talk about context windows in episode ten, we will be talking about how many tokens the model can hold in its working memory at once. When we talk about inference costs, about hallucination, about why the same prompt gives different answers depending on how you phrase it, we will be talking about the downstream effects of this first, invisible step.

The AI does not read your words. It reads fragments that a compression algorithm from nineteen ninety-four decided were statistically convenient. Everything that follows, every spark of apparent intelligence, every eerily human response, every confident mistake, is built on those pieces. The pieces you never see.

That was episode one of Actually, AI. The deep dive companion goes further into the algorithm itself, into glitch tokens, into the people who accidentally broke ChatGPT by counting to infinity on Reddit, and into the researchers trying to make tokenization obsolete. Find it right after this in your feed.