Attention Deep Dive: The Paper That Ate AI

Before the Paper

This is the deep dive companion to episode four of Actually, AI, attention. The main story covered the people and the core idea. This episode goes into the paper itself, the mechanics, the problems, and the remarkable afterlives of the eight people who wrote it.

We need to start three years before the transformer, in Montreal, with a Belarusian graduate student who was deeply skeptical of his own assignment.

In twenty fourteen, Dzmitry Bahdanau was working at Mila, Yoshua Bengio's machine learning lab, on a problem that annoyed him. The dominant approach to machine translation at the time used an encoder-decoder architecture. The encoder read an entire sentence in one language and compressed it into a single fixed-length vector. The decoder then tried to reconstruct the translation from that one vector. Bahdanau thought this was absurd.

I was super skeptical about the idea of cramming a sequence of words in a vector, but I also really wanted a PhD offer.

The skepticism was productive. Bahdanau had spent thousands of hours in competitive programming competitions under the handle "rizar," and his instinct was to find the elegant solution. When you translate a sentence, he reasoned, your eyes shift back and forth between the source and target. You do not stare at one fixed point. Why should a model be forced to compress everything into a single vector when it could learn where to look?

His first attempts were too complicated. Two cursors moving through sequences with dynamic programming. Hard-coded diagonal attention patterns. None of them had the right feel. Then the breakthrough. Instead of telling the model where to look, let it learn where to look. Let the decoder compute a score for each encoder position, turn those scores into a probability distribution, and use the distribution to create a weighted blend of the encoder states. A soft search across the entire source sentence.

It worked on the first try. Bengio added the name "attention" to the conclusion in one of the final editing passes. The paper, "Neural Machine Translation by Jointly Learning to Align and Translate," appeared on arXiv in September twenty fourteen. The team published fast because they knew Google researchers were working on something similar.

This was a significant step, but it was not the transformer. Bahdanau attention was cross-attention, the decoder attending to the encoder. The backbone was still a recurrent neural network, processing tokens one at a time. The attention mechanism was a helpful addition to recurrence, not a replacement for it. That replacement would take three more years.

What the Paper Actually Said

"Attention Is All You Need" appeared on arXiv on June twelfth, twenty seventeen. Fifteen pages including references. The claim was radical and stated plainly in the abstract. A new architecture, based entirely on attention mechanisms, with no recurrence and no convolutions. The authors called it the Transformer.

The title, as covered in the main episode, was a Beatles joke. The name "Transformer" was Jakob Uszkoreit's choice. He picked it because he liked how the word sounded, with a nostalgic nod to the cartoon robot franchise from his childhood. An early internal design document was literally titled "Transformers: Iterative Self-Attention and Processing for Various Tasks" and included an illustration of six characters from the Transformers cartoon.

The paper's core innovation was self-attention. In Bahdanau's work, the decoder attended to the encoder, one sequence looking at another. In the transformer, every position in a sequence attends to every other position within the same sequence. This is the leap. The model does not just look at source material to produce a translation. It looks at itself, building a rich representation of the input by letting every word consider every other word simultaneously.

The paper was tested on machine translation. It scored twenty eight point four BLEU on English-to-German, surpassing the previous best by more than two points, and forty one point eight on English-to-French. But the real victory was speed. The authors reported training the base model in twelve hours on eight NVIDIA P one hundred GPUs. Comparable recurrent models took days.

The NeurIPS reviews, which are publicly available, were positive but grounded. The first reviewer gave a clear accept, noting significant community interest and replication efforts already underway. The second praised the entirely novel architecture but wanted better mathematical definitions. The third hoped for a more in-depth expanded version.

The combination of these techniques and the details necessary for getting it to work as well as LSTMs is a major achievement.

The paper itself lists what each author contributed, which is unusual for a machine learning paper and gives us a remarkably clear picture. Uszkoreit proposed replacing recurrent networks with self-attention and started the effort to evaluate the idea. Vaswani and Polosukhin designed and implemented the first transformer models. Shazeer proposed scaled dot-product attention, multi-head attention, and the parameter-free position representation, and was involved in nearly every detail. Parmar designed, implemented, and tuned countless model variants. Jones experimented with novel variants and handled inference. Kaiser and Gomez built the Tensor2Tensor framework that made the research reproducible and greatly improved results.

Queries, Keys, and Values

The core mechanism deserves a careful explanation.

Every token in a sequence gets transformed into three vectors. A query, a key, and a value. The terminology comes from information retrieval. Think of a library. You walk in with a question, your query. Every book on the shelf has a label, its key. You compare your question against every label to figure out which books are relevant. Then you read the contents of the relevant books, the values, and blend the information based on how relevant each one was.

In the transformer, every token is simultaneously asking a question and being a book on the shelf. Token five's query gets compared against the keys of tokens one through fifty, or however long the sequence is. The comparison is a dot product, a simple multiplication that produces a single number measuring how similar two vectors are. High score means relevant. Low score means irrelevant. These scores get turned into a probability distribution through the softmax function, so they add up to one. Then the values get weighted by those probabilities and summed. The result is a new representation of token five that incorporates information from every other token in proportion to their relevance.

Noam Shazeer added a critical refinement. In high-dimensional space, dot products can grow very large, which pushes the softmax function into regions where its gradients become tiny and training stalls. His solution was elegant. Divide each dot product by the square root of the dimension. This scaling factor keeps the numbers in a well-behaved range. It sounds like a minor detail, but without it, the mechanism would not train reliably. This is scaled dot-product attention.

The library analogy breaks down in two important ways. First, in a real library, the books exist independently of the question. In a transformer, the queries, keys, and values are all produced by the same learned transformations applied to the same input. The model is simultaneously writing the questions and the books. Second, the "relevance" is not semantic in the way humans would define it. It is numerical similarity in a learned vector space, which often corresponds to semantic relevance but does not have to.

Many Heads Are Better Than One

One attention pattern is not enough. Consider the sentence "The lawyer who represented the defendant filed a motion." The word "filed" needs to attend to "lawyer" for its subject, but also to "motion" for its object, and perhaps to "defendant" for contextual meaning. A single attention head collapses all of these relationships into one weighted blend.

Multi-head attention runs several attention mechanisms in parallel, each with its own learned query, key, and value projections. The original transformer used eight heads. Each head can specialize. Research on what different heads learn has found that some focus on local context, nearby words and their relationships. Others track syntactic structure, connecting verbs to their subjects even when separated by many words. Others capture semantic relationships, abstract meaning patterns that are harder to categorize. Some heads focus on positional patterns. An important caveat from the research is that heads often perform multiple roles and their behavior is context-dependent. The specialization is a tendency, not a fixed assignment.

The outputs of all heads are concatenated and projected through one more learned transformation. The result captures multiple types of relationship simultaneously. This is what gives the transformer its representational power. One head might figure out that "it" refers to "the cat." Another might be tracking the sentence's grammatical structure. A third might be noting the emotional tone. The combined output is richer than any single attention pass could produce.

Why Recurrence Lost

The transformer did not just offer a new approach. It actively solved problems that had plagued recurrent networks for decades.

The deepest problem was the vanishing gradient. When you train a recurrent network, error signals must propagate backward through every time step. If the sequence is a hundred tokens long, the gradient must pass through a hundred multiplicative steps. If the weights are small, the gradient shrinks exponentially at each step. By the time it reaches the early tokens, the learning signal has effectively vanished. The first words in a sentence barely get adjusted during training.

Long Short-Term Memory networks, invented by Sepp Hochreiter and Jurgen Schmidhuber in nineteen ninety seven, partially solved this with gating mechanisms that controlled information flow. They were a genuine breakthrough that dominated sequence processing for twenty years. But they were a patch on a fundamental limitation. Very long sequences still degraded.

Self-attention connects any two positions with a single operation. Token one and token five hundred are equally close, computationally speaking. There is no chain of intermediate steps for the signal to degrade across. The path length between any two tokens is one. This is why transformers handle long text so much better than recurrent networks ever did.

The second advantage was parallelization. A recurrent network must process token two before it can process token three, because token three's computation depends on the hidden state produced by token two. You cannot skip ahead. This inherent sequentiality made recurrent networks slow to train, because modern hardware, particularly GPUs, excels at doing thousands of things simultaneously. The transformer processes all tokens at once. Every attention computation for every token happens in parallel. Shazeer articulated this directly.

Arithmetic is cheap and moving data is expensive on today's hardware. Transformers can solve those problems because you process the entire sequence simultaneously.

The third advantage only became clear later. Transformers scale. When you double the parameters, double the data, and double the compute, the performance improves in predictable, consistent ways. This is the scaling law behavior that we explore in episode seven. Recurrent networks did not scale as cleanly. They hit walls. Transformers kept climbing.

The Quadratic Problem

Attention has a cost, and it is a steep one.

Because every token attends to every other token, the number of attention computations grows with the square of the sequence length. If you have five hundred twelve tokens, the model computes roughly two hundred sixty two thousand attention scores. Double the sequence to one thousand twenty four tokens and you do not get double the computation. You get four times as much, roughly a million scores. At a hundred twenty eight thousand tokens, you are computing over sixteen billion scores. This quadratic scaling in both computation and memory is the fundamental bottleneck for processing long sequences.

This is why early transformers were limited to a few hundred tokens. This is why context window expansion has been one of the most intensely researched problems in AI. And this is why a PhD student at Stanford named Tri Dao became one of the most important people in modern AI, despite never publishing a paper about a new architecture.

The Hardware Trick

Tri Dao was going to study economics. During his first week at Stanford as an undergraduate, he took a few math classes and immediately switched to mathematics. He ended up in the Hazy Research lab doing a PhD in computer science, and the question he asked was not about the math of attention. It was about the memory.

The standard way of computing attention writes enormous intermediate matrices to the GPU's main memory. A GPU has two kinds of memory. A tiny fast cache called SRAM, about twenty megabytes on an A one hundred, which can move data at nineteen terabytes per second. And a large slow pool called HBM, forty to eighty gigabytes, which runs at about one and a half terabytes per second. That is a twelve-to-one speed difference. Standard attention constantly shuffles the huge attention score matrix between the fast cache and the slow pool. The computation itself is fast. The memory traffic is the bottleneck.

Dao's insight was to fuse multiple operations into a single computation kernel that keeps everything in the fast cache. The attention score matrix, the full sequence-length-by-sequence-length matrix, never materializes in its entirety. The algorithm processes it in small blocks, computing the needed results and discarding the intermediate values before they ever leave the fast memory.

The goal is you want to scale to longer sequences, but scaling to longer sequences is difficult because attention scales quadratically.

He called it FlashAttention. It was published in May twenty twenty two. The result was two to four times faster training with exact attention, no approximations, and memory usage that grew linearly instead of quadratically with sequence length. Previous attempts to solve the quadratic problem had used approximations, computing only a subset of the attention scores. Dao showed you did not need to approximate. You needed to be smarter about where you stored the numbers.

Even though they perform fewer computation, they tend not to be faster in wall-clock time.

That was Dao describing previous approximate methods, a polite demolition of years of competing research. His second version, FlashAttention-2, published in July twenty twenty three, was twice as fast again and approached the theoretical efficiency of pure matrix multiplication. He described it as probably the most optimized subroutine on the planet. Nearly every major open model now uses it. LLaMA, Falcon, MPT, RedPajama, and most others.

The insight was interdisciplinary. Kernel fusion, the technique of combining multiple operations into one GPU pass, was a well-established concept in systems engineering. Online softmax, the trick that allows blockwise attention computation, came from the machine learning side. Dao combined them. When asked about the work, he noted the hardware dependency with unusual honesty for someone whose work has been so widely adopted.

FlashAttention is optimized specifically for NVIDIA GPUs. If the hardware changes, the optimization changes. This is the hardware lottery, the uncomfortable fact that which algorithms succeed depends not just on their mathematical elegance but on which hardware happens to be dominant. Dao thinks about this, by his own admission, quite a bit. Attention's quadratic cost remains a theoretical problem. FlashAttention makes it a practical non-issue at current scales, but only because it is exquisitely tuned to one company's silicon.

The Missing Ingredient: Position

There is something attention cannot do on its own. It cannot tell where anything is.

Because every token attends to every other token simultaneously, the mechanism has no inherent sense of order. "The cat sat on the mat" and "The mat sat on the cat" would produce identical attention patterns. Word order carries meaning, and attention is blind to it. Position must be explicitly injected.

The original transformer used sinusoidal positional encoding. Each position in the sequence was represented by a unique pattern of sine and cosine waves at different frequencies, like giving each seat in a theater its own musical chord. Lower frequencies encoded coarse position, higher frequencies encoded fine distinctions. The system theoretically generalized to longer sequences than the model had seen during training. In practice, that extrapolation was limited.

The solution that most modern models use was invented by Jianlin Su, who developed it in a series of Chinese blog posts in early twenty twenty one before formalizing it in a paper. His approach, Rotary Positional Embeddings or RoPE, rotates the query and key vectors by an angle proportional to their position. When the model computes the dot product between a query and a key, the rotation means the result naturally encodes the relative distance between the two tokens. It is mathematically elegant, it preserves the geometry of the vectors, and it has become the standard. LLaMA, Mistral, Qwen, and most other modern large language models use RoPE.

An alternative approach, ALiBi, took an even more radical stance. Instead of encoding position in the embeddings at all, it simply adds a linear penalty to attention scores based on how far apart the query and key are. Tokens that are far from each other get their attention scores reduced. The advantage is that you can train on short sequences and run inference on much longer ones without degradation. The paper was provocatively titled "Train Short, Test Long." It is used in BLOOM and some other models, though RoPE has become the community default.

Beyond Words

In twenty twenty, a team at Google that included Jakob Uszkoreit published a paper titled "An Image is Worth Sixteen Times Sixteen Words." They took a standard image, chopped it into a grid of small patches, sixteen by sixteen pixels each, treated each patch as if it were a word, and fed the sequence of patches into a standard transformer. No convolutional layers. No special image processing architecture. Just the same attention mechanism, applied to image fragments instead of text tokens.

It worked. The Vision Transformer, or ViT, achieved excellent results on image classification tasks, proving that the transformer architecture was not specific to language. It was a general-purpose pattern engine.

After ViT, transformers invaded computer vision, audio processing, protein structure prediction with AlphaFold, drug design, robotics, and weather forecasting. The architecture that eight people built for machine translation turned out to be something closer to a universal computation pattern. That universality was not in the original paper. Nobody predicted it. The paper's ambitions were narrow, its implications were not.

The Transformer Diaspora

The main episode sketched the afterlives of the eight authors. Here is the full picture, and it is one of the most remarkable diaspora stories in the history of technology.

Ashish Vaswani, born in Nagpur to an architect and a doctor, raised partly in Oman, co-founded Adept AI with Niki Parmar and David Luan in twenty twenty one. The company built action models for automating digital tasks and raised three hundred and fifty million dollars at a billion-dollar valuation. Vaswani and Parmar left Adept over reported differences with investors and co-founded Essential AI in twenty twenty three, raising fifty six point five million in a Series A from March Capital, Nvidia, and Google. They are building enterprise AI systems. Parmar later joined Anthropic in December twenty twenty four, where she works on frontier capabilities and reinforcement learning research.

Today is as good a day as any to share that I joined Anthropic last December. Claude three point seven is a remarkable model at complex tasks, especially coding, and I am thrilled to have contributed to its development.

Noam Shazeer left Google in twenty twenty one out of frustration. He and Daniel de Freitas had built Meena, a chatbot praised for sophisticated dialogue, then watched it get renamed to LaMDA as corporate executives refused to release it, citing safety and fairness principles. Both times the team sought permission to deploy, both times leadership said no. Shazeer and de Freitas founded Character dot AI, a platform where users chat with AI versions of celebrities and custom personas. It raised over a hundred and fifty million dollars. Then lawsuits arrived. A Florida family alleged the platform contributed to a fourteen year old's suicide. Texas families filed similar claims. In August twenty twenty four, Google signed a two point seven billion dollar agreement for a non-exclusive license to Character dot AI's technology, and Shazeer returned to Google as technical lead on Gemini. The Department of Justice is investigating whether the deal was structured to circumvent regulatory oversight.

Jakob Uszkoreit, the man who proposed the foundational hypothesis, experienced three events in quick succession in late twenty twenty that changed his direction. His daughter was born during the COVID lockdown. AlphaFold two demonstrated that deep learning methods he had co-invented could predict protein structures. And the first mRNA vaccine efficacy results arrived. He left his dream job at Google after fifteen years and co-founded Inceptive with Stanford's Rhiju Das, using transformers to design novel RNA sequences for vaccines and therapies. The company raised a hundred million dollars in twenty twenty four from Andreessen Horowitz and Nvidia, tripling its valuation to over three hundred million.

It is really about spending the right amount of effort and ultimately energy on a given problem.

Aidan Gomez, the twenty year old intern, finished his undergraduate degree at Toronto, completed a PhD at Oxford in absentia while simultaneously building Cohere, and turned the company into an enterprise AI powerhouse. Cohere focuses on retrieval augmented generation, private deployment, and multilingual support across more than a hundred languages. Revenue grew from thirteen million dollars annually at the end of twenty twenty three to seventy million in January twenty twenty five, with a target of two hundred million by year's end. An August twenty twenty five funding round valued the company at six point eight billion dollars. Geoffrey Hinton's venture fund, Radical Ventures, led the early investment.

I think the world needs something better than the transformer. I think all of us here hope it gets succeeded by something that will carry us to a new plateau of performance.

Llion Jones, the last of the eight to leave Google, departed in August twenty twenty three after spending over a decade at Google Research in Tokyo. His stated reason was bureaucracy.

It is just a side effect of big company-itis. I think the bureaucracy had built to the point where I just felt like I could not get anything done.

He co-founded Sakana AI with David Ha, former head of research at Stability AI. The name means "fish" in Japanese, inspired by the collective intelligence of schooling fish. Instead of training enormous models from scratch, Sakana merges and refines existing ones through evolutionary algorithms. The company raised three hundred and seventy nine million dollars in total funding and reached a valuation of two point six five billion dollars in November twenty twenty five, making it Japan's most valuable unicorn startup.

Lukasz Kaiser took a different path. He was the only one of the eight to join another major research lab rather than founding a company. He went to OpenAI in June twenty twenty one, where he co-invented the reasoning models in the o one and o three series, contributed to ChatGPT, GPT-4, and GPT-5, and served as research lead for the o one model series. A man who started in automata theory and logic in Paris became one of the architects of the most commercially successful AI products ever built.

Illia Polosukhin, who had built that first transformer prototype after a brainstorming lunch, left Google in twenty seventeen, the same year the paper was published. He co-founded NEAR dot ai with Alexander Skidanov, initially exploring AI and program synthesis, but pivoted to blockchain after encountering capital constraints. NEAR Protocol launched in twenty twenty and raised over five hundred and fifty million dollars. Polosukhin became CEO of the NEAR Foundation and a vocal advocate for user-owned AI that keeps data encrypted and runs privately.

After the Russian invasion of Ukraine, Polosukhin founded the Unchain Fund to support conflict victims through blockchain-enabled humanitarian aid, raising nearly ten million dollars.

The combined startup funding raised by the eight authors exceeds three billion dollars. The combined peak valuations approach thirteen billion. When Jensen Huang presented those framed NVIDIA cover plates at GTC in twenty twenty four, with over two thousand people in the audience, seven of the eight were present.

Everything that we are enjoying today can be traced back to that moment.

The Attention Economy

The transformer's dominance raises a question the authors themselves are now asking. When one architecture absorbs an entire field, what gets lost?

Llion Jones's complaint is not just aesthetic. Before the transformer, the machine learning community explored a wide variety of architectures for different problems. Recurrent networks for sequences. Convolutional networks for images. Reservoir computing for time series. Memory networks for reasoning. After twenty seventeen, almost everything became a transformer. The architecture was so good, and scaled so well, that exploring alternatives felt like a waste of resources. Why build something new when making the transformer bigger keeps producing improvements?

That North Star, it was there on day zero, and so it has been really exciting and gratifying to watch that come to fruition.

Now we are just waiting for the fusion.

That exchange happened at the GTC panel. Shazeer had compared the transition from recurrent networks to transformers to the transition from steam engines to internal combustion. Polosukhin's response suggested the transformer itself might be a transitional technology. State space models like Mamba have shown promise as sub-quadratic alternatives. New hybrid architectures combine attention with other mechanisms. The field is beginning to explore what comes after.

But for now, the attention mechanism that Bahdanau sketched in Montreal, that Uszkoreit proposed as a complete replacement for recurrence, that Shazeer refined into multi-head scaled dot-product form, that all eight of them sprinted to get published before a NeurIPS deadline, remains the load-bearing structure of the entire AI industry. When you send a message to any major AI system, the words are tokenized, embedded into vectors as we covered in episodes one and six, and then those vectors pass through dozens of attention layers. Each layer lets every token look at every other token and decide what matters. The representations get richer with each pass, building up from raw pattern matching toward something that, by the output layer, produces coherent, contextually appropriate language.

Whether that process constitutes understanding is a question for a different episode. What is not in question is the engineering. Eight people, a lunch conversation, a hallway encounter, a Beatles joke, and a three-month sprint produced the architecture that the world runs on.

That was the deep dive for episode four.

The Jargon Jar

This episode's term: Transformer.

The version you would text a friend: It is the blueprint for how every major AI reads and writes. A specific design based on the attention mechanism. When someone says "a large language model," they almost certainly mean "a very large transformer."

How marketing uses it: As a magic word. "Transformer-based AI" appears in pitch decks and product pages the way "quantum" appears in science fiction. It sounds technical and impressive. It is usually there to signal sophistication without explaining anything.

What it actually means in practice: A transformer is a specific architecture with specific components. Attention layers, feed-forward layers, residual connections, and layer normalization, stacked dozens or hundreds of times. Input goes in as embeddings, passes through all the layers, and comes out as a probability distribution over possible next tokens. The name was coined by Jakob Uszkoreit in twenty seventeen because he liked how the word sounded and was thinking about the cartoon robots. Before the paper, the word meant an electrical device that changes voltage. Now it means the engine underneath practically every AI you have ever used. The earlier meaning was more precise, and honestly a little more useful if you are an electrician. But that ship has sailed.