Attention: Eight People and a Joke Title

The Ticker Tape

This is episode four of Actually, AI. If you want the full deep dive into the paper and every author, that companion episode is right after this one in your feed.

For most of the history of artificial intelligence, language was a conveyor belt. A model read a sentence the way an old stock ticker printed prices, one symbol at a time, left to right. Each word arrived, got processed, and was compressed into a running summary of everything that had come before. By the time the machine reached the end of a long sentence, the beginning was a faded echo, squeezed through a bottleneck that lost more with every step. This was the best anyone had. Recurrent neural networks, they were called. They worked well enough for short sentences. For anything longer, they quietly forgot.

In twenty seventeen, eight researchers at Google published a paper that threw out the conveyor belt entirely. The title was "Attention Is All You Need," a riff on the Beatles song "All You Need Is Love." It read like a joke. The contents were not a joke. That paper replaced the dominant architecture of the field, and every major AI system you interact with today, ChatGPT, Claude, Gemini, Llama, is a direct descendant of what those eight people built.

The part that makes this story strange is that none of them were trying to change the world. They were trying to make Google Translate a little faster.

The Canteen, the Hallway, and the Intern

The eight authors were not a research team in the usual sense. They were scattered across Google Brain and Google Research, working on different projects, and the paper came together the way a lot of breakthroughs do, through lunchtime conversations and hallway encounters that turned into something nobody planned.

The first spark came from Jakob Uszkoreit, a researcher whose father, Hans Uszkoreit, was one of Germany's most prominent computational linguists and the scientific director at the German Research Center for Artificial Intelligence. Growing up around language research gave Jakob an unusual perspective. In twenty sixteen, he started thinking about whether self-attention, a technique that let a model look at relationships within a sequence instead of processing it step by step, could replace recurrent networks entirely. His father was skeptical. The conventional wisdom was skeptical. Jakob pushed ahead anyway.

In early twenty seventeen, Uszkoreit, Ashish Vaswani, and a Ukrainian researcher named Illia Polosukhin sat down in the Google canteen to hash out the idea. When lunch ended, Polosukhin went back to his desk and built what may have been the very first transformer prototype. Meanwhile, Noam Shazeer heard colleagues in the hallway talking about replacing LSTMs with attention.

I heard a few of my colleagues in the hallway saying, let us replace LSTMs with attention. I said, heck yeah!

Shazeer was not a typical researcher. He had joined Google in two thousand, just two years after the company was founded. One of his first contributions was dramatically improving Google's "Did you mean" spelling corrector. He later built a language model called PHIL whose technology Jeff Dean used to quickly implement AdSense, bringing billions of dollars in new revenue. He had won a gold medal with a perfect score at the International Mathematical Olympiad in nineteen ninety four. When he turned his attention to the transformer project, he designed the specific multi-head attention mechanism that became the paper's beating heart.

The youngest of the eight was Aidan Gomez, a twenty year old undergraduate intern from the University of Toronto. He had come to Google Brain's Toronto office to work with Lukasz Kaiser, a Polish researcher who had traded a tenured position in automata theory at the University of Paris for a spot at the nascent lab. Together, Kaiser and Gomez built the Tensor2Tensor framework that made the transformer experiments reproducible. Gomez ran hundreds of configurations to find the parameters that worked.

Niki Parmar, the only woman on the team, had joined Google at twenty four, the youngest member without a PhD. She designed, implemented, and tuned countless model variants, testing what worked and what did not. Vaswani, born in Nagpur, India, had spent years studying machine translation at the University of Southern California before arriving at Google Brain. Llion Jones, a Welshman who had moved to Google's Tokyo office, handled efficient inference and the visualizations that helped the team understand what their creation was doing.

It was an intense three-month sprint to the NeurIPS paper deadline. On the night before submission, Gomez and Vaswani pulled an all-nighter at Google, sleeping in the office to make it. The paper appeared on arXiv on June twelfth, twenty seventeen. The author order was randomized. A footnote stated, "Equal contribution. Listing order is random."

I would be lying if I said I had any appreciation for what was to come. Those of us close to the metal were very focused on building something that was very good at translation.

The Machine That Looks Everywhere at Once

So what did they actually build? The core idea is attention, and it works nothing like the conveyor belt it replaced.

Imagine you are reading the sentence "The cat sat on the mat because it was tired." When you hit the word "it," your brain instantly knows it refers to "the cat." You do not need to replay the sentence from the beginning. You just look back and find the relevant word. That is attention. The model does something similar, except it does it for every word simultaneously, checking every other word for relevance.

Here is the mechanism, stripped to its bones. Each word in the sentence gets converted into three things. A query, which you can think of as the question "what am I looking for?" A key, which is the label "here is what I contain." And a value, which is the actual information. The model compares every query against every key, scores how well they match, and then blends the values based on those scores. Words that are relevant to each other get strong connections. Words that are not relevant get weak ones.

The old recurrent approach was like passing a note down a long line of people. By the time it reached the end, the message was garbled. Attention is like everyone in the room being able to talk to everyone else at the same time. The signal does not degrade over distance. A word at the beginning of a paragraph is just as accessible as the word right next to you.

This has an enormous practical consequence. Because every word can attend to every other word in parallel, the entire computation can happen at once instead of step by step. Graphics processors, which are designed to do thousands of calculations simultaneously, are suddenly a perfect match for the architecture. The old recurrent networks were bottlenecked by their own sequentiality. Transformers removed the bottleneck. Training times dropped. Scales increased. The analogy breaks down in one important place. Attention is not actually "looking." There is no understanding behind the matching. The model is computing numerical similarity scores between high-dimensional vectors. The result looks like comprehension, but it is matrix multiplication. Whether that distinction matters philosophically is one of those questions this series keeps running into.

The Descendants

The paper was about machine translation. It got state-of-the-art scores on English-to-German and English-to-French benchmarks. Within a year, it had reshaped the entire field.

The transformer architecture turned out to be unreasonably general. It worked for language generation. It worked for image recognition, after another team, which included Uszkoreit himself, showed that you could chop an image into patches and feed them to a transformer the same way you feed in words. It worked for protein folding. It worked for music composition, weather prediction, drug design. The architecture was not specific to language. It was a general-purpose pattern-matching engine that got better and better the more data and compute you threw at it.

That scalability, that property of getting disproportionately smarter with size, is the foundation of the scaling story we will explore in episode seven. Previous architectures improved with scale, but transformers improved with a consistency and steepness that nobody expected. The race to build bigger and bigger models, the hundred-billion-parameter systems, the trillion-token training runs, all of that became possible because the transformer could absorb the resources efficiently.

And the cost of attention is why your context window has limits. Because every word checks every other word, the cost grows with the square of the sequence length. Doubling the context does not double the work. It quadruples it. That is the bottleneck behind the engineering war we will cover in episode ten.

Every major AI system in the world runs on transformers. The paper has been cited over a hundred and seventy three thousand times. At the NVIDIA GTC conference in twenty twenty four, Jensen Huang gathered seven of the eight authors on stage and presented each with a framed cover plate from an NVIDIA supercomputer. The inscription read, "You transformed the world."

The Diaspora

Here is the part of the story that nobody could have predicted. All eight authors eventually left Google. Seven of them founded companies. Vaswani and Parmar co-founded two startups before Parmar joined Anthropic. Shazeer built Character dot AI, left in a two point seven billion dollar deal, and returned to Google to lead work on Gemini. Uszkoreit used transformer architecture to design novel RNA molecules for vaccines. Gomez co-founded Cohere, now valued at nearly seven billion dollars. Jones moved to Tokyo and built Sakana AI, Japan's most valuable unicorn, where his stated mission is to move beyond the very architecture he helped create. Polosukhin pivoted from AI to blockchain, co-founding NEAR Protocol. Kaiser joined OpenAI and co-invented the reasoning models behind their latest systems.

I am absolutely sick of transformers.

That was Llion Jones, at the TED AI conference. A co-inventor of the most successful architecture in the history of artificial intelligence, calling for the field to move past it. His argument is that despite unprecedented investment and talent, the dominance of one architecture has somehow caused a narrowing of the research being done.

Maybe he is right. Maybe the next breakthrough will come from something entirely different. But for now, the transformer remains the foundation underneath nearly everything. Attention, the mechanism, and the Transformer, the architecture built around it, are the reason you can talk to an AI and have it talk back in a way that feels like it understands you. Whether it does understand, well. That is a different episode.

That was episode four. The deep dive goes further into the actual paper, all eight authors and what happened to them, the mechanics of queries and keys and values, and the engineering trick that made long conversations possible. Find it right after this in your feed.