Embeddings Deep Dive: Inside the Geometry of Meaning

This is the deep dive companion to episode six of Actually, AI, embeddings.

The Two Architectures Nobody Remembers

In the main episode, we covered how Word2Vec converts words into positions in space by predicting words from their neighbors. That description was accurate but deliberately vague. There are actually two versions of Word2Vec, and they work in opposite directions. The first is called continuous bag of words. You take a window of context, say four words on either side of a gap, and you train the network to predict the missing word. If the surrounding words are "the," "quick," "brown," "jumps," "over," "the," "lazy," and "dog," the model learns to predict "fox" in the middle. It sees the neighborhood and guesses the resident.

The second version is called skip-gram, and it flips the problem entirely. You give the model one word, "fox," and ask it to predict the surrounding context. Which words are likely to appear near "fox"? The model has to learn that "quick" and "brown" and "jumps" are reasonable neighbors but "quantum" and "fiscal" are not. One word guessing its neighborhood. The same training data, the same embeddings at the end, but a fundamentally different question being asked during learning.

Here is the part that surprised the researchers. Skip-gram, the version that seems harder because it has to predict many words from one, turned out to produce better representations for rare words. If "aardvark" appears only twelve times in the training corpus, continuous bag of words barely learns anything useful because it rarely needs to predict "aardvark." But skip-gram sees "aardvark" twelve times as an input and has to predict its context each time. Twelve training signals instead of twelve needles in a haystack. For common words, continuous bag of words was faster and roughly equivalent. For rare words, skip-gram was the clear winner. Mikolov and his team offered both, but skip-gram became the default that most people used.

But neither version could have worked at scale without a trick called negative sampling, and this trick is where the real engineering genius lives. The naive way to train Word2Vec would require computing a probability distribution over the entire vocabulary for every single training example. If your vocabulary has three hundred thousand words, that means three hundred thousand calculations per example, billions of examples, an impossible computational bill. Negative sampling replaces that with a shortcut. Instead of asking "what is the probability of every word in the vocabulary," you ask a simpler question: "is this word pair real or fake?" For each genuine pair you saw in the text, you generate a handful of fake pairs by randomly substituting words from the vocabulary. The model learns to distinguish real context from noise. Ten fake pairs per real one means eleven calculations instead of three hundred thousand. This is what made Word2Vec trainable on a single machine in hours instead of weeks on a cluster. The paper describing this trick, co-authored with Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean, is the one that won the NeurIPS Test of Time Award in twenty twenty-three, a full decade after publication.

The Cherry on the Cherry-Picked Example

The main episode introduced the famous equation: king minus man plus woman equals queen. That is real, and it is remarkable. But this deep dive owes you the full story, which is less clean and more interesting.

The analogy test works like this. You compute the vector arithmetic, king minus man plus woman, and then you search the vocabulary for the nearest point to the result. There is a critical detail that most presentations leave out. The three input words, king, man, and woman, are excluded from the search. If they were not excluded, the answer would just be "king." The original vector is always closest to itself. There is a cosine similarity gap of more than zero point one between the excluded "king" and the next nearest result "queen." The analogy works, but it works with an asterisk that rarely gets mentioned in lectures or blog posts.

Researchers at the University of Groningen looked more carefully at the full set of analogy tests and found something uncomfortable. The male to female analogies, the ones that always appear in the presentations, are actually an exception. They are among the best-performing category. The test set in the original paper contained nearly nine thousand semantic questions and over ten thousand syntactic questions across fourteen categories. Some categories, like capital cities and verb tenses, performed well. Others, particularly in lexical semantics, performed poorly. A frustratingly high number of analogies, the researchers wrote, only worked when using the trick of not allowing the query word itself as the answer.

The male to female analogies typically given in lectures represent an exception, not the rule. The results are remarkable for some categories and unreliable for others.

None of this means the embeddings are a fraud. The geometry is real. Words that mean similar things genuinely cluster together. Relationships genuinely correspond to directions. The point is subtler than that. The most famous demonstration of embeddings happens to showcase their strongest behavior, and the community has been presenting the highlight reel as if it were the typical result for over a decade. Knowing where the trick works and where it does not is the difference between understanding embeddings and believing in them.

The Geometry of Prejudice

In twenty sixteen, a team led by Tolga Bolukbasi at Boston University published what would become one of the most cited papers in the bias literature. The main episode covered the headline results, man is to computer programmer as woman is to homemaker. The deep dive goes into the geometry of how that bias was found, why it was structured the way it was, and what happened when they tried to fix it.

Bolukbasi and his colleagues started with a specific hypothesis. They trained three hundred dimensional Word2Vec embeddings on Google News articles, roughly three million unique words drawn from professional journalism. Their assumption was that professional news writing would contain fewer stereotypes than, say, internet forums. They were wrong. The biases were precise, consistent, and geometrically organized.

Here is how they mapped it. They took ten pairs of words that define gender unambiguously: he and she, man and woman, king and queen, brother and sister, and so on. They computed the difference vector for each pair. Then they ran principal component analysis on those ten difference vectors and extracted the first component. That single direction, one line through three hundred dimensional space, captured the gender axis. Every word in the vocabulary had a position along this axis. "Receptionist" leaned strongly toward the female end. "Surgeon" leaned toward the male end. "Homemaker" was female. "Philosopher" was male. The biases were not scattered randomly. They were organized along a single geometric direction, as cleanly as the relationship between kings and queens.

The team tried to fix it. Their approach, called hard debiasing, projected gender-neutral words onto the subspace perpendicular to the gender direction, stripping away the gender component from words that should not have one. It worked, on the surface. Before debiasing, nineteen percent of the top one hundred fifty analogies generated by the model were judged stereotypical by crowd workers. After debiasing, only six percent were.

But in twenty nineteen, Hila Gonen and Yoav Goldberg published a paper with the blunt title "Lipstick on a Pig." They demonstrated that the debiased embeddings still carried recoverable bias. Removing the gender direction did not remove the underlying structure. The bias was woven through multiple dimensions, and flattening one axis left enough signal in the others that a simple classifier could still detect gender associations with high accuracy. The metaphor in the title was precise. You could put lipstick on the pig, but it was still a pig underneath.

The researchers had expected that professional journalism would show weaker biases than informal web text. The opposite was true. The stereotypes in news articles were remarkably precise and remarkably consistent.

One Hundred Years of Stereotypes

There is a sequel to the bias story that the main episode pointed at but did not enter. In twenty eighteen, Nikhil Garg and colleagues at Stanford and Princeton published a paper in the Proceedings of the National Academy of Sciences that turned embeddings into a measuring instrument for social change. They trained separate embedding models on text from each decade of the twentieth century, using the Google Books corpus and the Corpus of Historical American English, and then tracked how word associations shifted over time.

The results were a quantitative history of American prejudice. In the nineteen twenties, the words most closely associated with Asian Americans in the embedding space were "barbaric," "hateful," "monstrous," and "cruel." By the nineteen fifties, the associations had shifted to "disorganized," "pompous," and "unstable." By nineteen ninety, the cluster had moved again to "passive," "complacent," and "sensitive," the model minority stereotype replacing the earlier dehumanization. The embedding space was not just reflecting bias. It was dating it.

For gender, the most dramatic shift appeared in the nineteen sixties and seventies, coinciding with the women's movement. Competence adjectives like "intelligent," "logical," and "thoughtful" increased their association with women from nineteen sixty to nineteen ninety, a trend the researchers projected would reach gender parity sometime after twenty twenty. The word "hysterical," which sat in the top five words associated with women in the nineteen twenties, had dropped outside the top hundred by nineteen ninety.

The correlation between embedding bias and reality was startling. When the researchers compared the gender associations in Google News embeddings against actual United States Census workforce composition, the correlation was statistically significant with a p-value below one in ten billion. The embedding space was not merely reflecting stereotypes. It was reflecting the actual distribution of who held which jobs in American society, with enough fidelity to serve as a census proxy. The authors argued that embeddings trained on historical text could serve as what they called a quantitative lens for measuring how societies change. The map, it turned out, was also a time machine.

When Context Changes Everything

The main episode described embeddings as positions in space. One word, one position. That is how Word2Vec and its contemporary GloVe worked, and it was the state of the art from twenty thirteen to twenty eighteen. But it had an obvious problem. The word "bank" in "I deposited money at the bank" and "bank" in "I sat on the river bank" got the same embedding. One position for two completely different meanings. Static embeddings could not handle polysemy, the linguistic term for words with multiple meanings, because they assigned exactly one point in space to each word regardless of context.

In February twenty eighteen, a team at the Allen Institute for Artificial Intelligence published a paper that broke this limitation. The lead author was Matthew Peters, a researcher with an unusual background. He had a PhD in applied mathematics and had spent years in finance building mortgage models for banks, the financial kind, before joining AI2 in Seattle. His model was called ELMo, Embeddings from Language Models, named after the Sesame Street character because the researchers liked the name's whimsical flavor. ELMo ran two neural networks simultaneously over the text, one reading left to right and one reading right to left, and combined their outputs to produce an embedding for each word that depended on the entire sentence around it. "Bank" near "money" got a different embedding than "bank" near "river." The same word in different contexts now lived at different points in space.

The word embeddings should be a function of the entire input sentence. A word is not a fixed point. It is a function of its context.

ELMo was historically important, the first widely successful contextual embedding model. But it was almost immediately overshadowed. Eight months later, in October twenty eighteen, Jacob Devlin and his team at Google published BERT, Bidirectional Encoder Representations from Transformers, named after another Sesame Street character as a deliberate nod to ELMo. What had started as a whimsical naming choice became an entire tradition. After BERT came Big Bird from Google, ERNIE from Baidu, Kermit, Grover, and RoBERTa from Facebook. OpenAI's GPT-2 was almost called Snuffleupagus before someone decided it was not serious enough.

BERT's innovation was the masked language model, and its inspiration came from an unexpected place. In nineteen fifty-three, a psychologist named Wilson Taylor had developed something called the cloze test, from the Gestalt psychology principle of closure, the human tendency to fill in gaps. Taylor would delete random words from a passage and measure how easily readers could fill them back in as a way to test readability. Devlin borrowed the idea wholesale. BERT randomly masks fifteen percent of the tokens in its input and trains to predict the originals from both the left and right context simultaneously. This deeply bidirectional approach, reading in both directions at once rather than left to right like previous models, produced embeddings that captured meaning with unprecedented accuracy.

BERT came in two sizes. The base model used seven hundred sixty-eight dimensional embeddings, derived from twelve attention heads each operating in sixty-four dimensions, a convention inherited from the original Transformer paper. The large model used one thousand twenty-four dimensions. Both were trained on the Toronto BookCorpus and English Wikipedia, roughly three point three billion words total. BERT set new records on eleven natural language processing tasks simultaneously and accumulated over thirty thousand citations.

The Search Engine That Finally Understood Prepositions

In October twenty nineteen, Google deployed BERT into its search engine. The announcement came from Pandu Nayak, Google's Vice President of Search.

This is the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of search.

Nayak gave a specific example that illustrated why contextual embeddings mattered so much for search. Before BERT, if you searched for "twenty nineteen brazil traveler to usa need a visa," Google's algorithm struggled with the word "to." It could not determine whether the searcher was a Brazilian going to the United States or an American going to Brazil. The word "to" has no fixed meaning in a static embedding. It is entirely defined by its context. BERT, reading the full sentence bidirectionally, understood that "to" indicated direction of travel and returned results about Brazilian citizens visiting the United States. The change affected roughly ten percent of English language searches. With billions of searches happening every day, that meant hundreds of millions of results changed overnight, all because the search engine could now read prepositions in context instead of treating every word as a fixed point in space.

Sixteen Thousand Dimensions and Counting

The numbers tell a story of relentless expansion. Word2Vec in twenty thirteen used three hundred dimensions. That number was not derived from theory. Mikolov and his team tried different sizes and found that three hundred was sufficient to capture the nuance in their training data without wasting computation. BERT in twenty eighteen used seven hundred sixty-eight dimensions, a number that fell out of its architecture: twelve attention heads times sixty-four dimensions per head, where sixty-four was the per-head convention established by the original Transformer paper. GPT-3 in twenty twenty jumped to twelve thousand two hundred eighty-eight dimensions. Current models like GPT-4 likely use something close to sixteen thousand. Every number is divisible by two, an architectural constraint from the way attention heads and parallel computation split the work.

What do those extra dimensions buy? Not interpretability. Nobody can look at dimension seven thousand four hundred and twelve and say "this encodes formality" or "this captures whether the word relates to food." The dimensions are what researchers politely call latent representations, which is a formal way of saying "we do not know what most of them mean individually." Some dimensions correlate with identifiable properties. Researchers have found dimensions that track sentiment, word length, and national population. Grammar has geometric reality: consistent vector offsets for pluralization and verb tense show up as directions in the space regardless of how many dimensions you have. But the space is fundamentally a compressed tangle of statistical patterns, not a tidy spreadsheet with labeled columns.

What more dimensions definitely buy is capacity. Three hundred dimensions can encode that "happy" and "joyful" are similar and that "Paris" is to "France" as "Berlin" is to "Germany." Sixteen thousand dimensions can simultaneously encode those relationships and thousands of subtler ones: that a word carries slightly different connotations in legal versus medical contexts, that a phrase is more formal in British English than American, that a technical term shifted meaning between twenty fifteen and twenty twenty-three. Each dimension is one more axis along which the model can distinguish things that matter from things that do not.

There is a recent innovation that makes the dimension question less binary. Matryoshka representation learning, named after Russian nesting dolls, trains embeddings so that the most important information is packed into the first dimensions. You can truncate half the dimensions and lose surprisingly little performance on retrieval tasks. It is a pragmatic solution to a real problem: storing and searching sixteen thousand dimensional vectors for billions of documents is expensive, and most applications do not need the full precision of the largest models.

The Billion-Dollar Geometry Problem

The main episode mentioned retrieval-augmented generation and vector databases in passing. Here is how the machinery actually works, because an industry worth over one point seven billion dollars in twenty twenty-four rests on a surprisingly simple idea.

The pipeline has two phases. In the offline phase, you take your documents, whatever the chatbot needs to know about, and chop them into chunks. Each chunk gets converted into an embedding vector using an embedding model. Those vectors get stored in a specialized database. That is the indexing step. In the query phase, the user asks a question. That question gets converted into a vector using the same embedding model. The database searches for the stored vectors nearest to the question vector, usually by cosine similarity, which measures the angle between two vectors in the space. The nearest chunks get retrieved and stuffed into the prompt alongside the original question. The language model generates its answer grounded in the retrieved material rather than in its training data alone.

I realized the gap was not in training models. Everyone was focused on training. The gap was in working with the vectors those models produced, storing them, searching them, making them useful at scale.

Edo Liberty understood this before most people. He had been Director of Research at Amazon Web Services and head of Amazon AI Labs, where he had created SageMaker, Amazon's machine learning platform. In twenty nineteen, he founded Pinecone, a managed vector database, on the bet that the bottleneck was not the models but the infrastructure for using their outputs. Pinecone raised ten million in seed funding, twenty-eight million in Series A, and one hundred million in Series B from Andreessen Horowitz at a seven hundred fifty million dollar valuation. The open source alternatives followed: Milvus, with roughly twenty-five thousand stars on GitHub. Weaviate, pulling over a million Docker downloads per month. Qdrant, built in Rust for performance. Even PostgreSQL got a vector extension called pgvector, because sometimes the most powerful move is adding a feature to the database you already have.

By twenty twenty-five, the market had begun to consolidate. Vector search became a checkbox feature in major cloud platforms rather than a standalone product. The pure-play vector database startups faced a classic infrastructure problem: the capability they pioneered was being absorbed into general-purpose tools. But the underlying insight, that meaning has coordinates and finding relevant information is a geometry problem, had become permanent infrastructure.

When Meaning Crosses Borders

One of the most surprising properties of embedding spaces is that they look similar across languages. Train embeddings on English text and separately on Turkish text, and the resulting spaces have similar shapes. Not identical, but similar enough that you can learn a mapping from one to the other. In twenty eighteen, researchers at Meta published a technique that made this practical. They trained separate language-specific embeddings using fastText, Mikolov's successor to Word2Vec that represents words as bags of character fragments. Then they used a small bilingual dictionary, a few thousand word pairs, to learn a rotation matrix that aligned the two spaces. Turkish "futbol" and English "soccer" ended up near each other. French "maison" and English "house" became neighbors.

The results were striking. The aligned embeddings achieved ninety-five percent accuracy on languages they had never seen during training, with twenty to thirty times the speed of translating first and then classifying. The strongest predictors of whether two languages would align well turned out to be word order agreement and morphological similarity, which makes intuitive sense. Languages that build sentences in similar ways produce embedding spaces with similar geometries.

The implication goes beyond translation. If the spaces are alignable, it suggests something about the structure of meaning itself. Maybe the geometry is not an artifact of English or Turkish specifically but reflects something about how concepts relate to each other regardless of the language used to express them. That is a strong claim, and the evidence does not fully support it yet, languages with very different structures and small training corpora still align poorly, but the fact that it works as well as it does for dozens of language pairs is one of the most philosophically interesting results in the field.

The Othello Debate

In twenty twenty-three, Kenneth Li and colleagues published an experiment that reignited one of the oldest debates in artificial intelligence: does a model that predicts sequences actually understand anything about the world those sequences describe?

They trained a small GPT-style model to predict legal moves in the board game Othello. The model saw only move sequences, lists of positions played in order, with a vocabulary of just sixty tokens representing the sixty playable squares. It was never shown a board. It was never told the rules. It simply learned to predict what move would come next, the same way a language model predicts the next word.

Then the researchers probed the model's internal representations, its embeddings, to see what it had learned. Using nonlinear probes, small neural networks trained to read the model's internal state, they could recover the full board position with only one point seven percent error. The model had developed something that looked like an internal representation of the game board, a world model, despite never being shown one. When they tested a randomly initialized model with no training as a control, the probe error jumped to twenty-six percent, confirming that the representations were learned, not incidental.

But Neel Nanda and colleagues pushed back on the strongest interpretation. The original paper had found that linear probes, simpler classifiers that can only look for linear patterns, performed near random. This seemed to suggest the representation was complex and nonlinear, more like a genuine internal model than a statistical shortcut. Nanda showed that linear probes worked fine if you asked the right question. Instead of probing for absolute piece colors, black or white, you probed for relative ownership, my piece or their piece. With that reframing, linear probes achieved error rates below ten percent. The world model was real, but it was organized differently than the researchers initially assumed.

The interventional experiments were the most compelling evidence. Researchers could reach into the model's hidden states, alter the representation of a specific board square, and the model's predicted moves would change accordingly. The representation was not just correlated with the board state. It was causally linked to the model's behavior. Change the map and the navigator changes course.

This matters far beyond Othello. If a model trained only to predict the next token in a sequence develops internal representations of the underlying structure that generates those sequences, then the question of whether large language models "understand" anything becomes much harder to dismiss. The representation is there. It is causally active. But whether "a model with a recoverable board state" is the same as "a model that understands Othello" remains genuinely contested. The embedding space holds the evidence, but the jury is still deliberating.

The Bitter Footnote

Tomas Mikolov, the person most responsible for bringing embeddings into the mainstream, has a complicated relationship with the field he helped create. In the main episode, you heard about the moment of discovery, the colleague who called his idea silly. Here is the aftermath.

The Word2Vec paper was rejected from the main conference at ICLR twenty thirteen. The acceptance rate that year was approximately seventy percent, meaning nearly three quarters of submitted papers were accepted. Word2Vec was not among them.

Word2Vec paper got rejected, although today, it is probably more cited than all the accepted papers at ICLR twenty thirteen together.

When the Word2Vec code was ready for release, Google initially refused to approve it. Senior colleagues told Mikolov to stop trying. He had to find allies at Google Brain with enough organizational leverage to bypass the blockade. Google eventually approved the open source release around August twenty thirteen.

Once the code was open sourced, the interest skyrocketed.

But the code itself was notoriously difficult to read. Mikolov admitted he had over-optimized it during the months he spent waiting for Google's approval, trying to make it simultaneously faster and shorter. Not intentional obfuscation, he said, just a bored engineer making things more efficient while bureaucracy ground forward.

Mikolov left Google for Facebook AI Research in twenty fourteen and later returned to the Czech Republic, taking a position at the Czech Institute of Informatics, Robotics and Cybernetics in Prague in twenty twenty. When his paper won the NeurIPS Test of Time Award in December twenty twenty-three, a decade after publication, the acceptance speech was given by Jeff Dean and Greg Corrado. Mikolov's response on Hacker News was bitter. He reflected on how difficult it is for reviewers to predict the future impact of research papers, argued that the competing GloVe model from Stanford was slower and required more memory and had gained popularity mainly through an unfair comparison using more training data, called his earlier recurrent neural network language model project equally revolutionary as AlexNet, and concluded with the words, "money and power certainly corrupts people."

It is a familiar pattern in computer science. The person who builds the thing and the person who gets the credit are not always the same person. Mikolov did not invent the distributional hypothesis. He did not invent neural language models. He did not invent embedding spaces. But he built the tool that proved all of these ideas worked at scale, in a form simple enough that anyone could use it, and that tool changed the trajectory of the entire field. Whether the field has adequately acknowledged that is a question the embeddings themselves cannot answer.

The Jargon Jar

This episode's term: vector.

The version you would text to a friend: a list of numbers that represents something. A word gets turned into three hundred or sixteen thousand numbers, and those numbers place it in a space where similar things are nearby. That is a vector. It is a position with a direction.

How marketing uses it: "Our AI-powered vector search delivers semantic understanding of your data." Translation: we compute cosine similarity on arrays of floats.

What it actually means in practice: a vector is the format in which modern AI systems store and compare meaning. Every word, sentence, image, and song your AI tools process gets converted into a vector at some point in the pipeline. When someone says "vector database" or "vector search," they mean a system optimized for finding the vectors closest to a given query vector, which is another way of saying "finding the most similar things." The entire concept rests on the insight this episode explored: that similarity of meaning can be measured as proximity in a geometric space. The vector is the coordinate. The distance is the meaning.

That was the deep dive for episode six.