Airflow at Eleven Years: The DAG That Ate Data Engineering

The Tool One Engineer Built Because He Was Tired

Most of the data pipelines that run quietly in the background of the modern internet — the ones that move trade settlements at banks, that compute royalties at streaming services, that send you a marketing email twelve minutes after you abandoned a shopping cart — most of them run on top of a tool that one engineer started building in his spare time, in October of twenty fourteen, because his employer's existing pipeline tooling was making him miserable.

He worked at Airbnb. His name was Maxime Beauchemin. The tool he built was called Airflow, and over the next eleven years it would become the default way that the data engineering profession schedules work. This is the story of how that happened, why it almost did not happen, and what the project looks like now, in the spring of twenty twenty six, after the most ambitious release in its history.

Cron and Despair

To understand why Airflow mattered, you have to understand what existed before it. The answer is mostly cron. Cron is a Unix scheduler that has been around since nineteen seventy five. You give it a time and a command. It runs the command at that time. Cron does one thing. It does not understand that one job depends on another. It does not know if a script succeeded or failed. It does not care if a downstream task starts running while its upstream task is still in flight. Cron is a stopwatch that pulls triggers, and that is the entire job description.

[serious] For a small team with a handful of nightly jobs, cron is fine. But the moment your pipeline grows past a few dozen tasks, the cracks start showing. Somebody changes the schedule on the bookings extract. Now the join job, which used to start at three a.m., runs while the bookings are still being written. Half the rows are missing. The dashboards lie all morning. Nobody notices until the head of finance asks why the revenue number looks wrong.

The way most teams patched around this was with what people politely called workflow scripts. You wrote one big bash script that ran the jobs in order, with sleep statements between them, and added error handlers and email alerts and retry counters by hand. Every team had its own version of this script. Every script was slightly broken in its own personal way. This is the world Maxime Beauchemin walked into at Airbnb in twenty fourteen.

The Cambrian Explosion

Beauchemin had come from Facebook. He worked there from twenty twelve to twenty fourteen, and he has described that period as a Cambrian explosion of data tools. Facebook had thousands of engineers writing data infrastructure in public. Some of it became open source — Cassandra came out of Facebook, Presto came out of Facebook — and some of it stayed internal. One of the internal tools was called Data Swarm. It was a pipeline scheduler that actually understood dependencies. If job B depended on job A, Data Swarm would not start B until A had finished. If A failed, B would not run at all.

In twenty twenty four, by which point this counted as ancient history, Beauchemin sat down with a podcast called Data Renegades and described what happened next. When he interviewed at Airbnb, he made one condition.

If I join, could I work on building something like Data Swarm?

Airbnb said yes. He started in twenty fourteen, and within a few months, he had built the first version of the thing he wanted.

October the First

The first commit to the Airflow repository landed in October twenty fourteen. From the very beginning, the project was open source. Beauchemin did not build it as a closed Airbnb tool and donate it later. He built it in public.

The central idea, the one he carried over from Data Swarm, was the directed acyclic graph. You describe your pipeline as a graph of tasks. Task A produces something. Task B reads what A produced. Task C reads what B produced. The arrows have a direction, from upstream to downstream, and there are no loops. You cannot have A depending on B depending on A, because that would never terminate. This is just graph theory. It is one of the oldest objects in computer science. What Beauchemin did was make it the unit of work for data pipelines, and that turned out to matter enormously.

The second idea, which mattered just as much, was that you wrote the graph in Python. You did not configure your pipeline in a custom domain language. You did not fill in a graphical form. You wrote a Python script, and that Python script was your pipeline. If you needed branching logic, you used a Python if statement. If you needed a loop, you wrote a for loop. Pipelines as code, treated like software, versioned in git, reviewed in pull requests, tested. Most existing schedulers at the time treated pipelines as configuration. Beauchemin's bet was that data engineers were software engineers, and software engineers wanted to write software.

The third idea was harder to articulate, but turned out to seal the deal. Airflow shipped with a web interface from day one. You could open a browser and see your graph. You could see which tasks had finished, which had failed, which were still running. You could click on a failed task and read its logs without leaving the browser. For data engineers who had spent years grepping through cron output by hand, this was a different planet.

The Switzerland of Software

By the summer of twenty fifteen, Airbnb moved Airflow into its corporate GitHub account and announced the project publicly. In March twenty sixteen, it joined the Apache Software Foundation's incubation program. From that point on, the project was called Apache Airflow. By twenty nineteen it had graduated to top-level Apache status.

Donating an open source project to Apache is a particular kind of move. The Apache Software Foundation is a nonprofit that holds the intellectual property of certain open source projects in a neutral home. Once a project is under Apache, no single company can yank it back, change the license, or take the project commercial. Beauchemin has described Apache as the Switzerland of software. It is neutral ground.

For Airbnb, this was a strategic gift to the rest of the industry. For everyone else, it was a promise. If you bet your data infrastructure on Airflow, you would not wake up one day to find that some venture-backed startup had bought the project and changed the license. That promise mattered. Within a few years, Airflow was running production pipelines at Netflix, Spotify, Stripe, Lyft, and several thousand smaller companies. A whole new job title appeared in job listings — Airflow engineer — and a whole new conference appeared too, the Airflow Summit, which is still held every year.

The Long Decade

Once Airflow won, it had to deal with what happens when you win. Every data team in the world started using it, and every data team in the world found something to complain about.

The complaints were specific and they were earned. The scheduler was a single point of failure. The metadata database under the scheduler hit performance walls on big deployments. The user interface was rendered with an older Python templating engine and felt sluggish even on a small graph. Writing a new operator, which is what Airflow calls a reusable task type, meant digging into the framework's internals. Dynamic pipelines, where the shape of the graph depended on data that did not exist until the pipeline started running, were technically possible but felt like wrestling the framework into submission.

Airflow two point zero arrived in December twenty twenty. It was a serious release. The scheduler was rewritten to be highly available, so a single crash no longer brought everything down. A new authoring pattern called TaskFlow let you write pipelines as decorated Python functions, which felt much more natural than building task objects by hand. By any honest measure, two point zero was a strong improvement.

But during the five years between two point zero in twenty twenty and three point zero in twenty twenty five, two new projects appeared that argued, fundamentally, that Airflow had made the wrong bets at the foundational level. Those projects were called Prefect and Dagster, and they are a whole separate story, which we will get to next time. For now, the thing to know is that by twenty twenty three the data engineering community was openly debating whether Airflow was still the right default for a new project. The answer most senior engineers gave was a tired yes — yes, because of the integrations, the documentation, the community, the sheer mileage. But the yes had a tone of resignation to it.

The Reckoning

In April twenty twenty five, Apache Airflow three point zero shipped. It was the first major version in five years and the most ambitious release in the project's history.

The architectural change at the heart of three point zero was something the team called the Task Execution interface. In older versions of Airflow, the scheduler and the workers were tightly coupled. The scheduler basically reached into worker processes and ran tasks directly. This made it hard to run tasks in different languages, in different cloud regions, or behind a customer's firewall. In three point zero, tasks talk to the scheduler over a network application programming interface. The worker can be anywhere. It can run in another cloud. It can sit inside a customer's data center. It can be written in something other than Python.

The user interface was rewritten from the ground up using React and a new web server based on FastAPI. The team added a dark mode, which is a small thing that nonetheless made the internet very happy. They added directed acyclic graph versioning, which means every run of a graph is tied to a specific immutable snapshot of its definition. You can always go back six months later and know exactly what code executed on that historical run. For anyone in a regulated industry where auditors ask exactly that question, this was overdue by about a decade.

The biggest conceptual addition was asset-driven scheduling. Instead of saying run this graph every six hours, you can say run this graph whenever the upstream asset changes. A new file landing in a cloud storage bucket can trigger a graph. A new row in a database table can trigger a graph. Airflow can now react to data events, not just to clock ticks. This was the project's answer to the heretics — the new projects that had been arguing for years that thinking in clock ticks was obsolete.

A year later, in April twenty twenty six, three point two added multi-team support. A single Airflow deployment can now host multiple isolated teams, each with their own pipelines, connections, and permissions, without stepping on each other. The platform team manages one Airflow. The data engineering team and the machine learning team and the analytics team all share it without interfering. This pattern matters because the largest organizations had been running three or four separate Airflow deployments just to keep teams isolated. Now they can collapse those into one, which is exactly the kind of consolidation that happens in a mature platform.

What Now

Today, Apache Airflow has around thirty-eight thousand stars on GitHub. It is the default workflow orchestrator for a generation of data teams. The eleven-year-old directed acyclic graph abstraction has aged remarkably well. Pipelines really are graphs. Failures really do propagate downstream. Pipelines really do need to be versioned and tested like software.

Maxime Beauchemin did not stay at Airbnb. He went to Lyft for a while, then founded a company called Preset around Apache Superset, the open source dashboard tool he had also created at Airbnb. He is no longer the day-to-day maintainer of Airflow, but the project he started in October twenty fourteen now runs critical pipelines at thousands of companies. Banks settle trades on it. Streaming services compute royalties on it. Pharmaceutical companies run clinical trial analyses on it. The thing he built because cron was making him miserable now runs the back office of the modern internet.

The story is not over. The thing that makes Airflow interesting in twenty twenty six is that it is no longer alone. Three serious challengers have spent the last several years arguing that the directed acyclic graph itself is the wrong abstraction. Two of them argue for a different kind of graph. One of them argues for a different concept of execution entirely. One of those three was started by a former Airflow contributor who concluded that the project he had been helping to maintain could not be fixed from the inside. That is where we go next time.