At NeurIPS two thousand seventeen, a graduate student approached a young engineer at the poster session. The student was visibly emotional. He told the engineer that he had been stuck on his research for three years. Three years of fighting his framework instead of working on his ideas. Three years of debugging computation graphs that existed in a parallel universe from his actual code. Then he had switched to a new framework, and within three months he had made enough progress to graduate.
The engineer was Soumith Chintala. The framework was PyTorch. And the story that graduate student told, of years of frustration dissolving in months of productive work, was not unique. It was happening in labs across the world, one researcher at a time, one conversion at a time. By the end of two thousand seventeen, a quiet revolution was underway in machine learning. Not a revolution of algorithms or architectures, but of tools. The framework that Google had built to serve billions was losing the people who invented the ideas worth serving. And the framework that was winning them over had been built by a small team at Facebook, led by a developer from Hyderabad who had been rejected by twenty-seven universities.
This is episode twenty of What Did I Just Install. And this is the story of how usability defeated power, how a Lua framework became a Python framework, how Facebook's AI lab built the tool that an entire field adopted as its own, and how the person who built it eventually walked away.
If you have ever typed pip install torch, you have installed a library that does one thing at a staggering scale. It performs mathematical operations on multi-dimensional arrays, called tensors, and it automatically computes the gradients of those operations so that a neural network can learn from its mistakes.
That sounds simple. It is not simple. Training a neural network means feeding data through layers of mathematical transformations, measuring how wrong the output is, and then flowing backward through every transformation to adjust every parameter by a tiny amount. Forward pass, backward pass, repeat millions of times. The framework's job is to make this process fast enough to be practical, flexible enough to express any model architecture a researcher can imagine, and automatic enough that the researcher does not have to derive calculus by hand for every new idea.
Before PyTorch, before TensorFlow, before any of these frameworks, researchers did derive the calculus by hand. They wrote the forward pass and the backward pass separately, checked them against each other, and prayed they had not made a sign error somewhere in a chain of partial derivatives. A single mistake produced silent numerical errors that could take weeks to track down. The framework revolution, which began with Theano in Montreal and continued through TensorFlow and PyTorch, was fundamentally about eliminating that manual labor.
But PyTorch did something that its predecessors did not. It made the framework feel like ordinary Python. You could write a neural network using for loops and if statements. You could print the value of a tensor at any point during execution and see actual numbers, not a description of a future computation. You could set a breakpoint with any Python debugger and step through your model line by line. You could change the architecture in the middle of training and see what happened. The computation graph, that blueprint of mathematical operations that every framework needs, was built on the fly as your code executed. Every run created a fresh graph. Every experiment could reshape the model without redefining anything.
This sounds like a small difference. It was not a small difference. It was the difference between fighting your tool and using your tool. And to understand why that difference mattered so much, you have to understand what came before.
The story of PyTorch begins not in Python but in Lua, an unlikely programming language that most developers have never heard of. In two thousand two, a group of researchers began building a scientific computing framework called Torch. The name was exactly what it suggested, a tool to illuminate the dark landscape of machine learning computation. Over the following decade, Torch evolved through several versions, with the most significant being Torch7, published around two thousand eleven by Ronan Collobert at the Idiap Research Institute in Switzerland, Koray Kavukcuoglu at NEC Laboratories in Princeton, and Clement Farabet at New York University.
Torch7 was a genuinely good framework. It was fast, it was flexible, and it had excellent support for GPU computing at a time when most researchers were still training neural networks on CPUs. But it was written in Lua, and this was both its secret weapon and its fatal flaw.
Lua is a lightweight scripting language created in Brazil in nineteen ninety-three. It is elegant, fast, and beautifully designed. It has a tiny footprint and interfaces with C as naturally as breathing. Game developers love it. World of Warcraft uses it. Adobe Lightroom uses it. But the machine learning community does not love it, because the machine learning community lives in Python. Their data processing tools are in Python. Their visualization libraries are in Python. Their colleagues write Python. Their students learn Python. Every piece of infrastructure they touch speaks Python.
Torch7 chose Lua for technical reasons that were entirely sound. LuaJIT, the just-in-time compiler for Lua, was faster than any Python interpreter. The interface between Lua and C was cleaner. The language itself was simpler and more predictable. If you were building a numerical computing framework from scratch and cared only about performance and elegance, Lua was arguably the better choice.
But frameworks do not succeed on technical merit alone. They succeed on ecosystem. They succeed on the number of people who already know the language, who already have tools that work with it, who can copy a code snippet from a tutorial and run it without installing a new interpreter. And in this measure, Lua could not compete with Python. The researchers who used Torch7 were a devoted minority, an exclusive club that produced extraordinary work but never grew beyond a few thousand members.
One of those members was a graduate student from India who had arrived at New York University with nothing but persistence and a willingness to contribute code that nobody asked him to write.
Soumith Chintala grew up in Hyderabad, India, attending Hyderabad Public School. He was, by his own account, bad at math. This is a remarkable detail for someone who would go on to build the most widely used mathematical computing framework in machine learning. He was not the prodigy who won olympiads. He was not the natural who breezed through examinations. He was the kid who was interested in computers and problem-solving but whose grades did not reflect the kind of promise that top universities look for.
He enrolled at the Vellore Institute of Technology, known as VIT, a tier-two engineering college in southern India. In the hierarchy of Indian technical education, where IIT admission is the gold standard, VIT was a solid school but not the kind of credential that opens doors at elite research labs. Chintala graduated with a degree in information technology and decided he wanted to study in the United States.
He applied to twelve American universities for graduate admission. Every single one rejected him.
Most people would have taken the hint. Chintala did not take the hint. He flew to the United States anyway, on a J-1 exchange visa, and landed at Carnegie Mellon University with no formal admission and no clear plan. He applied again, this time to fifteen universities. He was rejected from all of them except two. USC accepted him. And NYU, through late admissions, accepted him as well. He chose NYU.
This is where the story turns. At NYU, Chintala worked in the lab of Yann LeCun, one of the three researchers who had kept the faith in neural networks through the long winter when the rest of the field had moved on. LeCun was already a towering figure in deep learning. He would go on to win the Turing Award alongside Geoffrey Hinton and Yoshua Bengio, the trio recognized as the godfathers of the field. As we covered in the TensorFlow episode, these three, working in Toronto, Montreal, and New York, had sustained neural network research through years of skepticism and underfunding.
At NYU, Chintala discovered Torch7. The Lua-based framework was used heavily in LeCun's lab, which was called the Computational and Biological Learning Lab, and Chintala began contributing to it. Not because anyone told him to. Not because it was part of his coursework. He contributed because the framework needed work and he wanted to help. This impulse, the desire to make tools better for other people, would define his entire career.
He graduated in two thousand twelve and immediately ran into the wall that separates academic credentials from industry jobs. He was rejected by DeepMind. Not once, not twice, but three times. His first real position was as a test engineer at Amazon, a role that was about as far from deep learning research as you could get while still working at a technology company.
A former PhD mentor helped him find a position at a small startup called MuseAmi, where he built deep learning models for music and vision targeted at mobile devices. The work was not glamorous, but it was real machine learning, and more importantly, it kept him connected to the Torch7 community. He kept contributing code. He kept making the framework better. He became, in the language of open source, a maintainer.
And then, in December of two thousand thirteen, something happened that would change the trajectory of Chintala's career and, eventually, the entire field of machine learning.
On December ninth, two thousand thirteen, Mark Zuckerberg announced that Facebook was creating a new research laboratory dedicated to artificial intelligence. It would be called Facebook AI Research, or FAIR, and it would be led by Yann LeCun. The mission was ambitious even by Silicon Valley standards. Build an open research lab, inside a social media company, that would advance the state of the art in artificial intelligence through open publication and open-source software.
LeCun's appointment was a signal that the deep learning revolution had arrived at the doorstep of the technology giants. Google had already started Google Brain in two thousand eleven. Baidu had been investing in deep learning in China. But FAIR was different from these other corporate labs in a crucial way. LeCun insisted that FAIR would operate like an academic lab. Researchers would publish their work openly. They would release their code. They would collaborate with universities. The entire point was to be part of the broader research community, not to hoard secrets for competitive advantage.
This philosophy would prove essential to PyTorch's success. But first, FAIR needed people.
Chintala had been contributing to Torch7, the framework LeCun's own lab used, for years. He had the skills. He had the track record of open-source contributions. And he had something that many more credentialed candidates lacked, a deep, practical understanding of what researchers needed from their tools, forged by years of actually building and maintaining those tools.
In May of two thousand fourteen, Soumith Chintala joined Facebook AI Research. He was twenty-six years old. He had been rejected by twenty-seven universities, turned away by DeepMind three times, and spent two years at jobs that had nothing to do with his ambitions. Now he was sitting in one of the most well-funded AI research labs on the planet, surrounded by some of the best researchers in the world.
The early years of FAIR were absolutely magical. It was a small team, deeply collaborative, building state of the art AI in an open environment. We had the resources of one of the largest technology companies in the world and the freedom of an academic lab. It was the best of both worlds.
The team used Torch7 extensively. And Chintala, who had been maintaining the framework as a volunteer, was now in a position to shape its future with the backing of a major corporation. He was not the most famous person in the lab. He was not the person whose name appeared on the most-cited papers. He was the person who made the tools work, who fixed the bugs that blocked other people's research, who sat at the intersection of engineering and science and made sure the bridge between them held. This role, the toolmaker, the infrastructure builder, the person who makes other people productive, is rarely celebrated in academia and often undervalued in industry. At FAIR, it would prove to be the most consequential role of all.
But shaping Torch7's future meant confronting its fundamental limitation. The framework was excellent. The language it was written in was not the language the world spoke.
By two thousand fifteen, the landscape of deep learning frameworks was crowded and fractured. TensorFlow had arrived from Google in November of that year, backed by enormous resources and an aggressive open-source strategy. Theano was still the framework of choice in many academic labs, particularly in Montreal. Caffe, built at Berkeley, dominated computer vision. And Torch7 continued to serve its devoted community, powerful but isolated behind the Lua barrier.
Chintala and his colleagues at FAIR could see what was happening. TensorFlow was vacuuming up mindshare. Not because it was the best framework for research, it was not, as we covered extensively in the TensorFlow episode, but because it was in Python, it was from Google, and it had an enormous marketing push behind it. The researchers who adopted TensorFlow would train the next generation of students on TensorFlow. The students would write their papers in TensorFlow. The ecosystem would crystallize around TensorFlow, and everyone else would be left outside.
The question was not whether to move Torch to Python. The question was how.
And then, in early two thousand sixteen, the answer arrived in the form of an email from a Polish university student.
Adam Paszke was studying computer science and mathematics at the University of Warsaw when he reached out to Soumith Chintala looking for an internship. Paszke was young, technically brilliant, and obsessed with automatic differentiation, the mathematical technique at the heart of every deep learning framework. Chintala saw an opportunity.
Adam reached out to me looking for internships, and I asked him to come do an internship to build the next version of Torch with modern design.
Sam Gross, an engineer at Facebook who was between projects, joined full-time as well. And Zeming Lin, another contributor who had fallen in love with a Japanese framework called Chainer, rounded out the founding team.
Chainer deserves a moment here, because without it, PyTorch might not exist. Built by Seiya Tokui at Preferred Networks in Tokyo, Chainer was the first deep learning framework to implement what its creators called "define-by-run." In every other framework of the era, you first defined your computation graph, specifying every operation in advance, and then you ran data through it. Chainer flipped this. In Chainer, the graph was defined by running it. You wrote Python code that executed immediately, and the framework recorded what you did, building the graph as a side effect of your computation. You could use ordinary Python control flow. You could branch and loop inside your model. You could print values. You could debug.
Chainer proved that define-by-run was not just possible but practical. It was a revelation for the researchers who tried it. But Chainer remained relatively obscure outside Japan, partly because of its smaller community and partly because it lacked the massive corporate backing that would bring a framework to global scale. The idea, though, was too powerful to stay in one place.
There was another inspiration. A library called torch-autograd, written by Alex Wiltschko and Clement Farabet, had brought automatic differentiation to Lua Torch. And torch-autograd itself was directly inspired by a library called HIPS autograd, created by Matt Johnson, Dougal Maclaurin, David Duvenaud, and Ryan Adams at Harvard. This library, HIPS autograd, would later also inspire what became JAX. The intellectual lineage is tangled and beautiful. PyTorch and JAX share a common ancestor, a library written at Harvard for differentiating Python and NumPy code.
The PyTorch team took all of these inspirations and made a series of deliberate design choices. They would take the C and CUDA backend of Torch7, the part that was fast and well-tested, and decouple it from Lua entirely. They would wrap it in a Python interface that felt native, not like a foreign function call. They would implement define-by-run automatic differentiation, so the computation graph would be dynamic, built on the fly, discarded after each backward pass. And they would make one choice that sounds obvious in retrospect but was radical at the time.
They would make the framework feel like Python.
Not "use Python as a scripting layer on top of a graph engine." Not "generate Python-like syntax that compiles to a different representation." They would make it so that writing a neural network in PyTorch felt exactly like writing any other Python program. The same debugger. The same profiler. The same mental model. No hidden compilation step. No separate graph language. Just Python.
The work was not done by Chintala, Paszke, and Gross alone. The team grew to include Francisco Massa, Adam Lerer, Gregory Chanan, Trevor Killeen, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Edward Yang, Zachary DeVito, and others. When the official PyTorch paper was published at NeurIPS in two thousand nineteen, it had twenty-one authors. Paszke was listed first, Chintala last, the traditional position of the senior author in machine learning publications. The paper's title captured the design philosophy in six words. "PyTorch: An Imperative Style, High-Performance Deep Learning Library."
The name was as straightforward as the design philosophy. Take "Torch," the framework they were building on. Put "Py" in front of it. PyTorch. Python plus Torch. The name told you exactly what it was. No clever acronym. No mythological reference. No venture-capital-friendly abstraction. Just a descriptor, like naming your car "the fast red one."
Compare this to the names of its competitors. TensorFlow was a precise technical description, tensors flowing through a computation graph. Theano was named after an ancient Greek mathematician. Caffe came from the Italian word for coffee. Keras came from the Greek word for horn, a reference to Homer's Odyssey. PyTorch said: this is the Torch you already know, but in Python.
The initial release came in September of two thousand sixteen, in alpha form. The public launch followed on January nineteenth, two thousand seventeen. The timing was significant. TensorFlow had been out for just over a year and had already established itself as the dominant framework, at least in terms of adoption numbers and corporate support. Keras, built by Francois Chollet and integrated into TensorFlow, had made Google's framework vastly more usable. Google was investing heavily in TensorFlow's ecosystem, its documentation, its community, and its hardware integration through the Tensor Processing Units.
PyTorch launched into this landscape with a small team, no hardware story, no massive documentation effort, and a user base that consisted largely of former Torch7 users who were excited to finally work in Python. It should not have won. By every conventional measure of corporate strategy, resources, marketing, ecosystem size, TensorFlow should have maintained its dominance for decades.
But PyTorch had something that TensorFlow did not have. It had the right answer to a question that researchers cared about more than anything else.
The question was simple. When your model produces wrong results, how do you figure out why?
In TensorFlow one point x, the answer was: you cannot, at least not easily. Because TensorFlow used a static computation graph, the model was defined in one phase and executed in another. The graph was a blueprint. The data flowed through the blueprint inside a session object. The values of your tensors did not exist until the session ran. You could not print a tensor and see a number. You could not set a breakpoint inside the graph. You could not step through the computation with a debugger. When something went wrong, you got an error message referencing a graph node with an auto-generated name like "dense slash kernel colon zero," a name that you had never typed and that corresponded to nothing in your mental model of your own code.
In PyTorch, the answer was: the same way you debug any Python program. Print the tensor. See the numbers. Set a breakpoint. Step through the code. The graph did not exist separately from the code. The graph was the code, built as it executed, a faithful record of what actually happened rather than a plan of what should happen. When something went wrong, the error pointed to a line you wrote, in a file you recognized, with variable names you chose.
The moment I switched to PyTorch, it felt like someone had turned the lights on. I could see what my model was doing. Not a description of what it would do, not a plan, not a graph node. I could see actual numbers, at every layer, at every step. It was like going from assembly language to Python. Three years of struggling with TensorFlow, and within three months of PyTorch I had my first real results.
This was not a theoretical advantage. It was a practical one that changed how fast researchers could work. In machine learning research, the cycle of hypothesis, experiment, result, and revised hypothesis runs hundreds of times per project. Every minute spent debugging the framework instead of the model is a minute of wasted research. Over a PhD, those minutes compound into months, and those months determine whether a student publishes or perishes.
The dynamic graph had another advantage that was subtler but equally important. In TensorFlow, the graph had to be fully specified before any computation could run. This meant that architectures with variable-length inputs, recursive structures, or conditional computation required elaborate workarounds. You had to express branching logic and loops using TensorFlow's own control flow operations, which were different from Python's control flow operations, and which generated different graph structures depending on the execution path. Writing a recursive neural network in TensorFlow was an exercise in frustration.
In PyTorch, you just wrote a for loop. If you wanted your model to behave differently depending on the input, you wrote an if statement. If you wanted recursion, you called the function recursively. The graph would record whatever happened, no matter how complex the control flow, because the graph was built by tracing the actual execution. The researcher's creativity was no longer constrained by the framework's graph language. It was constrained only by Python itself, which imposed no constraints at all.
The adoption curve was unlike anything the machine learning community had seen. It did not follow the usual pattern of a new framework slowly gaining users through documentation and tutorials and conference talks. PyTorch spread through laboratories like a rumor, one graduate student showing another, one lab hearing about it from another lab, one researcher trying it on a side project and never going back.
The mechanism was simple and organic. A doctoral student in one lab would try PyTorch for a side project. They would find that the code was shorter, the debugging was faster, and the iteration cycle was dramatically tighter. They would mention it to a labmate. The labmate would try it. Within weeks, the entire lab would be using PyTorch for new projects, while grudgingly maintaining their old TensorFlow code for legacy experiments. Within a semester, the incoming students would learn PyTorch from their seniors, never touching TensorFlow at all. The framework spread through the academic network the way ideas spread, through personal recommendation and direct experience, not through marketing campaigns or corporate evangelism.
Nobody told us to switch. There was no mandate from the department, no email from the advisor. Someone in the lab showed me a PyTorch notebook and I could read it. I could understand what was happening. The TensorFlow code for the same model was twice as long and I had never fully understood the session mechanics. Within a week I had rewritten my main experiment in PyTorch and the debugging process went from hours to minutes.
The numbers confirmed what the hallway conversations suggested. In two thousand eighteen, PyTorch was a minority framework at most major conferences. By two thousand nineteen, every major conference had a majority of papers implemented in PyTorch. Sixty-nine percent of papers at CVPR, the premier computer vision conference, used PyTorch. More than seventy-five percent at NAACL and ACL, the top natural language processing venues. More than fifty percent at ICLR and ICML, the leading machine learning conferences.
The most telling statistic was the migration pattern. Among researchers who had used TensorFlow in two thousand eighteen, fifty-five percent switched to PyTorch in two thousand nineteen. Among researchers who had used PyTorch in two thousand eighteen, eighty-five percent stayed with PyTorch. TensorFlow was not just failing to grow in research. It was actively losing the researchers it already had, and they were going to exactly one place.
Google saw what was happening. In two thousand nineteen, four years after the original TensorFlow release, they shipped TensorFlow two point zero. The flagship feature was eager execution by default. Operations would run immediately when called, just like in PyTorch. You could print values. You could use Python control flow. The sessions were gone. The placeholders were gone. The feed dictionaries were gone. A decorator called tf dot function let you opt back into graph compilation for performance.
It was, as we noted in the TensorFlow episode, an admission that the original design had been wrong for a significant portion of TensorFlow's users. But the admission came too late. The researchers had already moved. The papers were already being written in PyTorch. The students were already learning PyTorch. The curriculum had shifted. And in machine learning, where the research community drives the innovation pipeline, losing the researchers meant losing the future.
TensorFlow two point zero makes eager execution the default mode. We believe this will make TensorFlow more intuitive and easier to debug, while retaining the ability to leverage graph optimizations for production deployment.
The corporate language was careful and measured. But what it did not say was just as important as what it did say. It did not say "we were wrong." It did not say "PyTorch showed us a better way." It did not say "we designed our framework for Google's internal needs and assumed the research community would accept the complexity." The announcement framed eager execution as a TensorFlow innovation, not as a course correction inspired by a competitor. But everyone in the machine learning community knew exactly what had happened.
For several years after the migration, the machine learning world existed in a curious split. Research ran on PyTorch. Production ran on TensorFlow. The two communities barely overlapped.
This made a certain kind of sense. TensorFlow had spent years building production infrastructure. TensorFlow Serving could deploy models at scale. TensorFlow Lite ran on mobile phones. TensorFlow dot js ran in web browsers. Google's Tensor Processing Units, custom chips designed for machine learning, were optimized for TensorFlow. If you were training a model that needed to serve a billion users, TensorFlow had the tooling and the track record.
PyTorch, by contrast, had focused relentlessly on the researcher experience and had largely ignored production deployment. If you trained a model in PyTorch and wanted to deploy it, you were on your own. You could export it to ONNX, an interchange format that was theoretically compatible with other frameworks and runtimes, but the conversion was lossy and unreliable. You could use TorchScript, an attempt to compile PyTorch models into a portable representation, but it imposed restrictions on the Python features you could use, which defeated much of the point of using PyTorch in the first place.
The result was a bizarre workflow that persisted across much of the industry for years. Researchers developed models in PyTorch because PyTorch was faster to iterate with. Then they rewrote the models in TensorFlow for production deployment, because TensorFlow had the serving infrastructure. The rewrite was tedious, error-prone, and often introduced subtle bugs. But nobody had a better option.
The absurdity of this situation became clearer with every passing year. Companies were paying engineers to translate working models from one framework to another, a process that added weeks to every deployment pipeline and created a permanent class of "framework translation" bugs that existed nowhere in the original research code. Startups that began life in PyTorch hired TensorFlow specialists solely to handle deployment. Universities that taught PyTorch in their machine learning courses had to teach TensorFlow in their production engineering courses. The two communities read different documentation, attended different meetups, and thought about computation in fundamentally different ways.
But the split contained the seeds of its own resolution. As more and more of the research world moved to PyTorch, the pressure to deploy PyTorch models in production became unbearable. Companies that had invested heavily in TensorFlow infrastructure began building bridges. Amazon Web Services developed TorchServe, a model serving framework specifically for PyTorch models. Microsoft added PyTorch support to its Azure Machine Learning platform. Even NVIDIA, which had historically been framework-agnostic in its public positioning, began optimizing its tooling increasingly around PyTorch workflows.
The research community was too important to ignore. The best ideas came from PyTorch code. The best students learned PyTorch. And companies needed those ideas and those students. The framework that researchers chose was, inevitably, the framework that industry would have to support.
This split was not sustainable. And the cracks began to show not from the outside but from inside Google itself.
In two thousand eighteen, a small team at Google Brain quietly released a library that would complicate the framework landscape in ways that nobody anticipated. The library was called JAX, and it took a fundamentally different approach from both TensorFlow and PyTorch.
JAX was not, strictly speaking, a deep learning framework. It was a system for composable function transformations on Python and NumPy code. You wrote ordinary Python functions. JAX could automatically differentiate them. JAX could automatically vectorize them. JAX could compile them to run on GPUs and TPUs through Google's XLA compiler. And it did all of this through a functional programming paradigm, where functions were pure, state was explicit, and transformations composed cleanly.
The creators of JAX, a team that included Roy Frostig, Matt Johnson, Dougal Maclaurin, and Chris Leary, were building on the same intellectual lineage that had influenced PyTorch. Matt Johnson and Dougal Maclaurin were among the creators of the HIPS autograd library at Harvard, the same library that had inspired torch-autograd, which had inspired PyTorch's automatic differentiation. The family tree was incestuous in the way that all fundamental research is incestuous. Good ideas migrate between groups, accumulate improvements, and emerge in new contexts wearing new names.
JAX appealed to a different kind of researcher than PyTorch. Where PyTorch felt like Python, JAX felt like mathematics. Where PyTorch let you write imperative code with mutable state, JAX insisted on pure functions and immutable data. Where PyTorch built a graph dynamically by tracing your code, JAX transformed your functions through explicit, composable operations. The trade-off was clear. JAX was harder to learn but more powerful for certain kinds of work, particularly large-scale distributed training and research that required precise control over parallelism and compilation.
Google began using JAX internally for its most important research. DeepMind, which Google had acquired in two thousand fourteen, adopted JAX for much of its work. The models that powered Gemini, Google's largest language model, were trained on JAX infrastructure. JAX was not trying to replace TensorFlow in the same way that PyTorch had. It was carving out a different niche entirely, one focused on the intersection of machine learning research and high-performance computing.
JAX is not really competing with PyTorch for the same users. It is more like a different philosophy of computation. PyTorch says, write Python and we will differentiate it. JAX says, write pure functions and we will transform them in every way you can imagine. Both are valid. They attract different minds.
And then came the irony that the entire machine learning community noticed but nobody quite knew how to process.
Adam Paszke, the University of Warsaw student who had co-created PyTorch's autograd system, who had been one of the first people Chintala recruited, whose mathematical brilliance was woven into the foundation of the framework, left Facebook and joined Google. He joined the team working on JAX.
There is no public drama to this story. No falling out, no corporate espionage, no bitter departure. Paszke simply found that JAX's approach to function transformations interested him more than maintaining PyTorch's imperative interface. He wanted to explore different ideas about what scientific computing could look like. He went to the place where those ideas were being explored.
But the symbolism was impossible to ignore. The co-creator of PyTorch, the person who built the autograd engine that made the framework possible, was now working on what many considered PyTorch's most serious intellectual competitor. At Google. For Google DeepMind, specifically, where he became a Senior Staff Research Scientist working on projects called Pallas and Mosaic, tools that extended JAX's capabilities for accelerator programming.
I was drawn to the functional approach. The idea that you can express computation as pure transformations, that you can compose those transformations in clean and predictable ways, felt like the right direction for scientific computing. It was a different philosophy from what we built in PyTorch, but both philosophies have their place.
This is a pattern we have seen before in this series. The creator moves on. TJ Holowaychuk left Express for Go. Salvatore Sanfilippo left Redis because maintenance killed creativity. Ryan Dahl gave Joyent everything and walked away. Solomon Hykes left Docker. The tools outlive their creators, and the creators move on to whatever interests them next. The framework is not the person. The person is not the framework. But in PyTorch's case, the framework was so successful that it could absorb the departure of a co-creator without missing a step.
By two thousand twenty-two, PyTorch had a problem that was, in some ways, the best problem a software project can have. It had become too important for any single company to own.
Meta, as Facebook had renamed itself, was still the largest contributor to PyTorch. But PyTorch was now used by virtually every major technology company, every major research university, and an ecosystem of startups that had built their entire technical stacks around it. Amazon Web Services, Google Cloud, Microsoft Azure, AMD, and NVIDIA had all made significant investments in PyTorch compatibility. If Meta had decided to change PyTorch's direction in a way that served Meta's interests but not the broader community's, the consequences would have rippled across the entire machine learning ecosystem.
On September twelfth, two thousand twenty-two, Meta announced that it was transferring PyTorch to the Linux Foundation, establishing the PyTorch Foundation as a new entity to govern the project's development. The founding members of the board included AMD, Amazon Web Services, Google Cloud, Meta, Microsoft Azure, and NVIDIA, a consortium that represented virtually the entire cloud computing and hardware industry.
The creation of the PyTorch Foundation ensures that business decisions are made in a transparent and open manner by a diverse group of members for years to come, while technical decisions remain in control of individual maintainers.
This move was significant for what it was and for what it was not. It was a genuine transfer of governance. Meta would remain the largest contributor but would not have unilateral control. The Foundation would ensure neutral branding, fair processes, and open development. But it was not a gift with no strings attached. Meta benefited enormously from PyTorch's dominance. Every researcher who learned PyTorch was a potential Meta employee who would be productive on day one. Every model trained in PyTorch was compatible with Meta's infrastructure. The transfer to a foundation made PyTorch more trustworthy for other companies, which meant those companies would invest more in PyTorch, which meant the ecosystem would grow, which meant Meta would benefit even more.
Compare this to the pattern we have seen with other corporate open-source transfers in this series. npm was sold to GitHub which was sold to Microsoft. Express was handed to StrongLoop which was acquired by IBM. Redis was relicensed by the company built around it, triggering the Valkey fork. In each case, the transfer was driven by corporate needs, and the community was a secondary consideration. The PyTorch Foundation transfer was different because it happened proactively, before a crisis, and because the governance structure genuinely distributed power rather than concentrating it.
It was also, quietly, an acknowledgment that PyTorch had won. You do not transfer a losing project to a neutral foundation. You transfer the project that the entire industry depends on, because the dependency itself creates obligations that a single company should not bear alone.
And then the world noticed what the machine learning community had been building.
In November of two thousand twenty-two, one month after the PyTorch Foundation was established, OpenAI released ChatGPT. Within five days, it had a million users. Within two months, it had a hundred million. The general public, which had spent decades hearing about artificial intelligence as a distant promise, suddenly had a conversation partner that could write essays, debug code, explain quantum physics, and compose poetry. The AI winter was over, not gradually but all at once, in a blast of public attention that transformed the technology industry overnight.
ChatGPT ran on a model trained with a custom infrastructure that reportedly combined elements of multiple frameworks. But the explosion it triggered, the thousands of companies that raced to build their own AI products, the researchers who pivoted their entire careers toward large language models, the startups that raised billions of dollars on the promise of generative AI, all of that activity ran overwhelmingly on PyTorch.
When Meta released LLaMA in February of two thousand twenty-three, the open-source large language model that democratized access to powerful AI, it was written entirely in PyTorch. When Stability AI released Stable Diffusion, the image generation model that put AI art on every social media platform, it was built on PyTorch. When thousands of researchers and hobbyists fine-tuned these models, adapted them, improved them, and built products on top of them, they did it in PyTorch. The framework that had won the research community was now, by extension, the framework on which the entire generative AI revolution was being built.
The scale of this was staggering. In two thousand twenty-three alone, over twenty thousand research papers used PyTorch. More than one hundred forty thousand GitHub repositories were created with PyTorch code. The PyTorch Tools ecosystem grew by over twenty-five percent. Contributions came from more than three thousand five hundred individuals and three thousand organizations. The framework that had started as a small project by a handful of engineers at Facebook was now the substrate on which the most transformative technology of the decade was being developed.
And through all of this, the PyTorch Foundation's governance held. The diversity of its board, Amazon and Google and Microsoft and Meta and NVIDIA all sitting at the same table, meant that no single company's strategic interests could redirect the framework. When Meta wanted PyTorch to optimize for LLaMA-style models, it contributed those optimizations upstream. When NVIDIA wanted PyTorch to take advantage of new GPU architectures, it contributed the kernel implementations. When Amazon wanted better support for its SageMaker platform, it built the integrations. The Foundation structure turned potential conflicts into contributions.
For years, PyTorch's critics had a valid complaint. The framework was wonderful for research but mediocre for production. The dynamic graph, the very feature that made PyTorch so researcher-friendly, made it harder to optimize. A static graph, like TensorFlow's original design, could be analyzed, optimized, and compiled ahead of time. Operators could be fused. Memory could be preallocated. The entire computation could be transformed into a form that squeezed every last drop of performance from the hardware. A dynamic graph, by definition, could not be analyzed ahead of time because it did not exist ahead of time.
PyTorch two point zero, released in March of two thousand twenty-three, changed this with a single feature called torch dot compile.
The idea was elegant. You wrote your PyTorch code exactly as before, using the full dynamic, Pythonic interface. Then you added one line. One decorator. And PyTorch would trace your code, capture the computation graph it produced, and compile that graph into highly optimized machine code using a new backend called TorchInductor. The graph was still dynamic in the sense that it could change between runs. But within a single run, the captured graph could be optimized as aggressively as any static graph.
The results were dramatic. Internal benchmarks across one hundred and sixty-three open-source models showed an average forty-three percent speedup during training with a single-line change to the code. The change worked in ninety-three percent of the projects tested.
This was PyTorch's answer to the performance question, and it was a characteristically PyTorch answer. Instead of forcing users to change how they wrote their code, PyTorch changed how it executed their code. The user experience stayed the same. The performance caught up. Usability first, optimization second. Always.
If PyTorch's victory over TensorFlow in research was the first act, and the Foundation transfer was the second, then the third act was the emergence of an entire ecosystem that assumed PyTorch was the only framework that mattered.
The clearest example is Hugging Face, the company and platform that became the GitHub of machine learning models. The Transformers library, Hugging Face's flagship product, supported both TensorFlow and PyTorch for years. But in two thousand twenty-five, with the release of Transformers version five, Hugging Face made the decision official. They dropped TensorFlow and Flax support entirely. PyTorch became the sole backend. The announcement was framed as a practical decision about engineering resources, but the subtext was unmistakable. The machine learning world had made its choice, and the choice was PyTorch.
The numbers tell the story of that choice. The Transformers library had grown from supporting forty model architectures in its version four release to over four hundred in version five. It had gone from roughly one thousand model checkpoints on the Hugging Face Hub to more than seven hundred and fifty thousand. Daily installations had grown from twenty thousand to more than three million. And all of this, every model, every checkpoint, every research paper that accompanied them, ran on PyTorch.
The broader ecosystem told the same story. When Meta released LLaMA, the open-source large language model that triggered the explosion of open-source AI in two thousand twenty-three, it was written in PyTorch. When Stability AI released Stable Diffusion, the image generation model that put AI art on every social media feed, it was written in PyTorch. When researchers at universities around the world trained the models that pushed the boundaries of natural language processing, computer vision, and reinforcement learning, they trained them in PyTorch.
PyTorch had not just won the framework war. It had become the default assumption, the water in which the machine learning fish swam, so ubiquitous that using anything else required justification.
On November sixth, two thousand twenty-five, Soumith Chintala published a blog post that surprised the entire machine learning community.
I am stepping down from PyTorch and leaving Meta on November seventeenth. I did not want to be doing PyTorch forever, and it seemed like the perfect time to transition right after I got back from a long leave and the project had built itself around me. Eleven years at Meta. Nearly all my professional life. Making many friends for life. Almost eight years leading PyTorch, taking it from nothing to ninety percent plus adoption in AI.
The blog post was long, personal, and remarkably honest. Chintala wrote about not wanting to be like Guido van Rossum or Linus Torvalds, tied to the same project for decades. He wrote about needing to know what was out there, needing to do something small again. He could have moved to something else inside Meta, but he could not live with the counterfactual regret of never trying something outside the company that had been his entire professional home.
I could have moved to something else inside Meta. But I needed to know what is out there. I needed to do something small again. I could not live with the counterfactual regret of never trying something outside Meta.
He reflected on what PyTorch had become. It handled exascale training. It powered foundation models that were redefining intelligence. It was in production at virtually every major AI company. It was taught in classrooms from MIT to rural India. The most joyful moments, he wrote, were meeting users eager to share their happiness and feedback. The graduate student at NeurIPS two thousand seventeen who said three months of PyTorch accomplished what three years of another framework could not. The researcher who told him that PyTorch had made deep learning feel like programming again.
In January of two thousand twenty-six, Chintala was announced as the Chief Technology Officer of Thinking Machines Lab, the startup founded by Mira Murati, the former Chief Technology Officer of OpenAI. The kid from Hyderabad who had been rejected by twenty-seven universities, who had worked as a test engineer at Amazon, who had spent years contributing code to a Lua framework that nobody outside a small community knew about, was now the CTO of one of the most closely watched AI companies in the world.
But Chintala was not the only departure from the old guard. Just five days after Chintala announced his departure from Meta, Yann LeCun, the man who had founded FAIR, who had hired Chintala, who had co-won the Turing Award for keeping the deep learning faith alive through the winter, announced that he too was leaving Meta. After twelve years, five as FAIR's founding director and seven as the company's chief AI scientist, LeCun was starting his own company. Advanced Machine Intelligence Labs, or AMI Labs, headquartered in Paris, raised over one billion dollars in its first funding round, the largest first round in European history, backed by NVIDIA, Eric Schmidt, and Jeff Bezos.
You certainly do not tell a researcher like me what to do.
The lab that LeCun had built, that Chintala had joined as one of its earliest members, that had produced PyTorch and dozens of other research breakthroughs, was losing its founders. The tools they built would remain. The ideas they championed would persist. But the people were moving on, drawn by the same restless energy that had brought them together in the first place.
If you type pip install torch on a machine in two thousand twenty-six, here is what you get. The torch package itself, a large download that includes precompiled binaries for your operating system and, if you have an NVIDIA GPU, bundled CUDA libraries. Its direct Python dependencies are modest. Filelock, for managing file access. Jinja2, which appeared in this series as the template engine maintained by Armin Ronacher, used here for generating kernel code. Typing-extensions, for Python type annotation support. Sympy, a symbolic mathematics library, used for shape inference and symbolic reasoning about tensor operations. NetworkX, a graph library, used internally for representing computation graphs. And MarkupSafe, a dependency of Jinja2 that ensures safe string handling.
That list looks small. It is deceptive. Because while PyTorch's direct Python dependencies are minimal, its true dependency tree extends into the hardware. PyTorch is deeply, fundamentally tied to NVIDIA's CUDA platform. Every GPU-accelerated operation in PyTorch runs through CUDA kernels. The bundled CUDA libraries in the pip package can weigh over two gigabytes. When you install PyTorch, you are not just installing a Python library. You are installing a bridge between Python and the specific hardware that NVIDIA designs and manufactures.
This is the unspoken dependency that nobody in the machine learning community likes to talk about. PyTorch supports AMD GPUs through the ROCm platform. It supports Apple silicon through Metal Performance Shaders. It supports Intel GPUs through their extension libraries. But the overwhelming majority of PyTorch users, the researchers, the companies, the cloud providers, run on NVIDIA hardware. When researchers write papers about training large language models, they measure performance in NVIDIA A100 or H100 hours. When companies budget for machine learning infrastructure, they budget for NVIDIA GPUs. PyTorch did not create this dependency on NVIDIA. But PyTorch is the layer through which most of the world's machine learning code reaches NVIDIA's hardware, and that makes it one of the most strategically important pieces of software in the AI ecosystem.
Now trace what depends on PyTorch. The Transformers library from Hugging Face. PyTorch Lightning, the training framework that adds structure and boilerplate reduction. torchvision, for computer vision. torchaudio, for audio processing. Every research repository that trains a neural network. Every startup that fine-tunes a language model. Every cloud platform that offers machine learning as a service. The dependency tree above PyTorch is not a tree. It is a forest, and it stretches from university labs to production systems serving billions of users.
And at the root of that forest, beneath the Python abstractions and the CUDA kernels and the automatic differentiation, is the C backend that Ronan Collobert and Koray Kavukcuoglu and Clement Farabet wrote for a Lua framework twenty years ago, refactored and extended beyond recognition but still there, still doing the fundamental work of multiplying tensors.
Zoom out far enough and the PyTorch story is not really about deep learning frameworks at all. It is a story about the same tension that runs through every episode of this series. The tension between power and usability.
Google built TensorFlow for power. Maximum scalability, maximum correctness, maximum capability at planetary scale. The assumption was that users would accept the complexity because the capability justified it. This assumption was correct for Google's own engineers, who were trained on Google's own tools and who worked on Google's own problems. It was catastrophically wrong for everyone else.
Facebook built PyTorch for usability. Make it feel like Python. Make debugging work. Make the researcher's mental model match the framework's actual behavior. The assumption was that users would choose the tool that got out of their way, even if that tool was initially less powerful for production deployment. This assumption was correct for the research community, and the research community turned out to be the kingmaker.
We have seen this pattern before. In episode two, Kenneth Reitz built Requests because Python's built-in HTTP library was powerful but hostile. In episode twelve, Sebastian Ramirez built FastAPI because the existing web frameworks were capable but required too much boilerplate. In episode six, Richard Hipp built SQLite because existing databases were powerful but required a server. In each case, the tool that prioritized the developer's experience won, even when the more powerful alternative had a head start, more resources, and deeper institutional support.
PyTorch is the largest-scale validation of this principle that the software world has seen. This was not a utility library or a web framework. This was the foundational tool for an entire scientific discipline, a tool that needed to perform at the absolute bleeding edge of computational performance. And even here, even at the scale where performance is life and death, usability won. The researchers chose the tool that felt like Python over the tool that felt like a configuration language. And then the industry followed the researchers.
The lesson is not that power does not matter. PyTorch is enormously powerful. Torch dot compile closed much of the performance gap with TensorFlow. The framework supports distributed training across thousands of GPUs. It handles exascale workloads for the largest AI models ever built. But the power came second, layered on top of a foundation that prioritized the human experience. TensorFlow tried to add usability on top of a power-first foundation, with Keras and later with eager execution. It worked, partially, but it was never quite as natural because the foundation was not designed for it.
The most joyful moments of building PyTorch were meeting users eager to share their happiness, love, and feedback. I do miss the intimacy of the PyTorch community, with a three hundred person conference that felt like an extended family gathering, but I feel that is a small price to pay considering the scale of impact PyTorch is truly having today.
So here is where the deep learning framework story stands, as of early two thousand twenty-six.
PyTorch dominates. Sixty percent of all machine learning papers with associated code use PyTorch. More than seventy million downloads per month on PyPI alone. Over twenty thousand research papers and one hundred forty thousand GitHub repositories in the past year. The Hugging Face ecosystem, the single most important platform for sharing and deploying machine learning models, is PyTorch-only as of Transformers version five. Every major foundation model, from Meta's LLaMA to Stability AI's Stable Diffusion to the open-source models that power the thousands of AI startups that have emerged in the past three years, is built on PyTorch.
TensorFlow persists, particularly in production systems that were built during its era of dominance and in mobile and edge deployment where TensorFlow Lite has no equal. Google still uses TensorFlow internally for some workloads, though JAX has increasingly taken over for research and for the largest models. TensorFlow is not dead, but it is no longer growing, and the ecosystem is slowly migrating away.
JAX occupies its own space, used primarily at Google, DeepMind, and a handful of research groups that prize its functional programming model. It is the framework for people who think in mathematics rather than code, and for workloads that require the kind of precise control over parallelism and compilation that JAX's transformation-based approach enables. It is brilliant, powerful, and will likely never have mass adoption, which is fine because mass adoption was never the point.
And Chainer, the Japanese framework that proved define-by-run was possible, that inspired PyTorch's core design, announced in December of two thousand nineteen that it would stop development and transition to PyTorch. The creators chose PyTorch because they believed it was the closest framework in spirit to what they had built. The parent absorbed the child's philosophy, and the child gracefully stepped aside.
The person who started it all is gone from the project. Soumith Chintala left PyTorch and Meta in November of two thousand twenty-five. Adam Paszke, who co-built the autograd system, works at Google DeepMind on JAX. Yann LeCun, who founded the lab where PyTorch was born, left Meta to start AMI Labs in Paris. The founding generation has dispersed, pursuing new ideas, new companies, new challenges.
But PyTorch remains. The framework handles exascale training. It powers the foundation models that are redefining what artificial intelligence can do. It is taught in classrooms from MIT to rural India. It is the tool through which most of the world's machine learning researchers express their ideas, and through which most of the world's AI systems are built.
And it all started because a student from Hyderabad, rejected by twenty-seven universities, kept contributing code to a Lua framework that nobody asked him to improve, until the day he was in a position to build something better. Something that felt like Python. Something that got out of the researcher's way. Something that an entire field would adopt as its own.
That is the story of PyTorch. The tool that felt like Python, built by a team that listened to what researchers actually needed instead of telling them what they should want. And every time you type pip install torch, every time you import torch and create a tensor and watch the gradients flow backward through your model, you are standing on the work of that team, and the work of the Torch7 team before them, and the work of the Chainer team in Tokyo, and the work of the autograd team at Harvard, and the work of the Montreal lab that started it all with Theano.
The dependency chain of ideas is longer than any requirements dot txt.
That was episode twenty of What Did I Just Install.
pip install torch. Open a Python shell. Type import torch, then x equals torch dot randn three comma three, then print x. You will see a three by three matrix of random numbers. That is a tensor. Now type y equals x times two, then y dot sum dot backward, then print x dot grad. You just computed a gradient. Automatically. No calculus, no manual derivatives, no graph compilation. Two lines of code. That is what Soumith Chintala and Adam Paszke wanted deep learning to feel like. Like Python.