Embeddings: Meaning Is Geometry

This is episode six of Actually, AI.

The Question Nobody Asks

You type "What is the best restaurant near me" into a search engine, and it returns results about "top dining spots in your area." You did not say "dining." You did not say "top." You did not say "spots." And yet the search engine knew what you meant. Not because it understands English. Not because it has a thesaurus. Because somewhere in the machine, your words and those words occupy nearby positions in a space you cannot see.

Here is what most people assume: the computer matches keywords. You typed "restaurant," it found pages with "restaurant." If you are a little more sophisticated, you might think it uses synonyms, some kind of lookup table where "restaurant" maps to "dining" and "eatery" and "cafe." That is closer, but still wrong. What actually happens is stranger and more elegant. The computer converts your words into points in a geometric space with hundreds of dimensions, and then it measures the distance between those points. "Restaurant" and "dining" are not linked by a rule. They are near each other because they appeared in similar contexts across billions of sentences. Meaning, it turns out, has a shape. And that shape is called an embedding.

The idea sounds abstract until you see the most famous example. Take the point in space that represents "king." Subtract the point that represents "man." Add the point that represents "woman." The nearest point to the result is "queen." That is not a metaphor. That is actual arithmetic, performed on lists of numbers, producing the correct answer. The relationship between "man" and "woman" is a direction in this space, and that same direction connects "king" to "queen," "brother" to "sister," "uncle" to "aunt." Gender is not a label. It is a direction you can travel.

A Czech Researcher and a Silly Idea

Tomas Mikolov did not set out to discover the geometry of meaning. He was a speech recognition researcher from the Czech Republic, born in the small city of Sumperk, who had come to Google in twenty twelve with a specific dream: he wanted to see Google Translate rebuilt around neural networks. He had been working on neural language models since his PhD at Brno University of Technology, and he had a stubborn conviction that simpler models, trained on more data, would outperform the complicated architectures his colleagues preferred.

In twenty thirteen, Mikolov and his team at Google published a model called Word2Vec. The idea was almost embarrassingly simple. Take a huge pile of text. For each word, look at the words around it. Train a small neural network to predict a word from its neighbors, or to predict the neighbors from the word. That is the entire algorithm. No parsing, no grammar rules, no hand-built knowledge. Just prediction, billions of times, on billions of words.

What came out was not a better language model. What came out was a map. Every word in the vocabulary had been assigned a position in a space of three hundred dimensions, and those positions captured something that looked remarkably like meaning. Words that meant similar things clustered together. Words that shared relationships pointed in the same direction. And that is when someone suggested the test that would make Word2Vec famous.

He asked me what the closest vector would be after subtracting "man" from "king" and adding "woman." I told him this was a rather silly idea and that nothing sensible could come out of this.

They went to a computer and tried it anyway. The answer was "queen." Mikolov's colleague started playing around, trying past tenses of verbs and plurals. Everything worked. The model had never been told that kings and queens were related, that verbs have tenses, that countries have capitals. It figured out those relationships on its own, just from the statistics of which words appear near which other words.

The insight underneath was not new. A British linguist named John Rupert Firth wrote in nineteen fifty-seven that "you shall know a word by the company it keeps." That single sentence is the theoretical foundation of every embedding ever computed. Words that appear in similar contexts have similar meanings. Mikolov did not invent this idea. He built a machine that proved it was true, at scale, and that the proof took the form of geometry.

Points in Space

Here is the mechanism, as simply as it can be stated without becoming wrong.

Every word, or more precisely every token from episode one, gets converted into a list of numbers. In Word2Vec, that list was three hundred numbers long. In a modern model like GPT-4, it is likely closer to sixteen thousand. Each number represents a position along one axis of a space with that many dimensions. You cannot visualize sixteen thousand dimensions. Nobody can. But the math does not care whether you can visualize it.

Think of it this way. Imagine you could describe every word with just two numbers: how positive or negative it is, and how active or passive it is. "Happy" might be positive five, active three. "Sad" might be negative four, passive two. "Furious" might be negative three, active five. Plot those on a grid and you get a two-dimensional map of emotional words. Now imagine instead of two dimensions, you have thousands. Each dimension captures some axis of meaning that the model discovered during training. Not axes a human chose, not axes anyone can necessarily name, but axes along which words differ from each other in consistent, measurable ways.

The analogy works until here, and then it breaks. The breakdown point is important. In our two-dimensional example, the axes had clear labels: positivity, activity. In a real embedding space, most dimensions do not have clean labels. Researchers have found dimensions that correlate with gender, formality, or time period, but the space is not organized like a spreadsheet with tidy columns. It is a compressed tangle of statistical patterns where many concepts overlap across many dimensions. The geometry is real. The labels are mostly not.

What is remarkable is what this geometry makes possible. Similar words are nearby: "dog" and "puppy" and "hound" cluster together. Relationships are directions: the vector from "Paris" to "France" points in nearly the same direction as the vector from "Berlin" to "Germany." Grammar is geometry too: the offset from "walk" to "walking" is the same offset from "swim" to "swimming." The model has no concept of grammar, no concept of capitals, no concept of anything. It has positions, distances, and directions. And those turn out to be enough.

The Invisible Infrastructure

So where does this show up in your life? Everywhere, and you never see it.

When you search the web and the engine returns relevant results despite your imprecise wording, that is embeddings. Your query and the documents have been converted to points in the same space, and the engine returned the nearest points. When a streaming service recommends a song you have never heard but somehow love, that is embeddings. The song and your listening history live near each other in a space of musical features. When a chatbot like the one powering this series retrieves relevant information from a database before answering your question, that is embeddings. The question and the stored text are compared as geometric points, and the closest matches get fed to the model as context.

This technique has a name: retrieval-augmented generation, usually shortened to RAG. The entire pipeline rests on the simple idea that if you convert both the question and the documents into the same geometric space, then finding relevant information becomes finding nearby points. An industry of vector databases, companies like Pinecone and Weaviate and Milvus, exists to make this geometric search fast at scale. A market worth over a billion dollars in twenty twenty-four, built on the insight that meaning has coordinates.

But embeddings are not just infrastructure for search. They are the language every modern AI model speaks internally. When you send a message to a chatbot, the first thing that happens is tokenization, episode one. The second thing is embedding: each token gets looked up in a table that assigns it a position in the model's internal space. From that point forward, the model operates entirely in geometry. Attention, episode four, is a geometric operation on these points. The output is a geometric operation on these points. The entire architecture, from input to output, is a machine for manipulating positions in a space of meaning.

That is why this episode matters for the series. Tokens are the alphabet. Embeddings are the meaning those letters carry. Everything else operates on that meaning.

What the Map Reveals About the Mapmaker

There is a darker side to meaning as geometry. In twenty sixteen, a team led by Tolga Bolukbasi at Boston University trained embeddings on Google News articles and then ran the analogy test. "Man is to computer programmer as woman is to..." The answer was "homemaker." "Man is to doctor as woman is to..." The answer was "nurse." The embedding space, trained on millions of news articles written by professional journalists, had encoded the biases of the society that produced the text.

The geometry was precise. The researchers found that gender bias was not scattered randomly through the space. It was organized along a single direction, a gender axis that you could mathematically identify by comparing pairs like "he" and "she," "man" and "woman," "king" and "queen." Every word in the vocabulary had a position along this axis, and those positions reflected stereotypes: "receptionist" leaned female, "surgeon" leaned male. The bias was not noise. It was structure.

This should not have been surprising, but it was. The training data was the world, reflected in text, and the world is biased. The model learned the shape of that bias with the same fidelity it learned that Paris is the capital of France. But the discovery reframed embeddings. They were not just useful representations of meaning. They were mirrors of the culture that produced the training data, precise enough to be used as a measuring instrument. Researchers later trained embeddings on text from each decade of the twentieth century and tracked how stereotypes shifted over time. The word "hysterical" dropped from the top five words associated with women in the nineteen twenties to outside the top hundred by nineteen ninety. The model was not just reflecting bias. It was quantifying social change.

The Thread

Here is where embeddings sit in the map of this series. In episode one, you learned that AI does not read words. It reads tokens, fragments of text that do not always align with human intuition. Now you know what happens next. Each token becomes a point in a geometric space, a list of thousands of numbers that places that token in relation to every other token. The meaning of a word is not stored in a definition. It is stored in a position.

Attention, from episode four, operates on these positions. When the model figures out that "it" in "the cat sat on the mat because it was tired" refers to "the cat" and not "the mat," it is computing relationships between the geometric points that represent those tokens. Training, from episode three, is what shapes the space in the first place, slowly adjusting the positions of every token through billions of examples until similar meanings are near each other and relationships become directions.

And if you want to know where the analogy breaks down, where the king minus man plus woman equation stops working, what happens when the same word needs different positions in different sentences, and how an entire industry of vector databases was built on top of these ideas, the deep dive is waiting for you right after this in your feed.

That was episode six.