The Copilot Problem: Who Owns What You Gave Away

The Machine That Learned to Code

This is episode thirty-nine of Git Good.

In June of two thousand twenty-one, Nat Friedman, the CEO of GitHub, posted a message that sounded like a dispatch from the future.

We spent the last year working closely with OpenAI to build GitHub Copilot. We have been using it internally for months, and cannot wait for you to try it out. It is like a piece of the future teleported back to two thousand twenty-one.

The product worked like this. You opened your code editor, started typing a function, and Copilot finished it for you. Not a single line. Entire blocks of code. It predicted what you were trying to build and wrote it before you could. The demo was mesmerizing. Developers who tried the technical preview came back stunned. It felt like magic. It felt like the future had indeed arrived early.

One year later, in June of two thousand twenty-two, Copilot launched as a paid product. Ten dollars a month. A hundred dollars a year. Available to anyone with a credit card. And almost immediately, the magic started to curdle. Because the developers who were amazed by what Copilot could do started asking a different question. Not how does it work, but what did it learn from?

The answer was simple and staggering. GitHub Copilot was trained on every public repository on GitHub. Millions of them. Code written by hundreds of thousands of developers, published under open source licenses, uploaded to a platform that promised to host their work. That code, all of it, had been fed into a machine learning model built by OpenAI, and the model had been turned into a commercial product sold by Microsoft, which owns both GitHub and a large stake in OpenAI. The developers whose code powered the model were not asked. They were not paid. They were not even told until the product was already built.

The platform that hosted the world's open source code had used that code to train a product that competed with the people who wrote it.

The License Nobody Wrote For

To understand why this became a crisis and not just a complaint, you need to understand something about open source licenses. They were written for a world where software was copied, modified, and distributed by humans. The GNU General Public License, written by Richard Stallman in nineteen eighty-nine, says that if you take GPL-licensed code and build something with it, your creation must also be released under the GPL. This is copyleft. The idea is that freedom is contagious. You benefit from the community's work, so your work benefits the community in return.

The MIT license is simpler. Do whatever you want with the code, just keep the copyright notice attached. The Apache license adds patent protections. The BSD license is similar to MIT with minor variations. All of them share one assumption. A human being decides to use the code. A human being copies it into their project. And the license terms, whatever they are, travel with the code.

Copilot broke that assumption. The code was not copied into a project by a human. It was ingested by a neural network during training, transformed into statistical patterns across billions of parameters, and then reconstructed, sometimes verbatim, when a developer typed a prompt that matched those patterns. The question that nobody had anticipated was whether this process, training, transforming, reconstructing, constituted "use" under the license. And if it did, which license terms applied to the output.

GitHub's position was clear. Training a machine learning model on publicly available code is fair use. The output belongs to the person who prompted it. The licenses attached to the training data do not transfer to the generated code. GitHub offered no legal argument for this position. They simply asserted it.

The people who had spent decades building and defending open source licenses did not agree.

The Copyleft Maximalists

Bradley Kuhn had been enforcing the GPL for longer than most programmers had been writing code. As the policy fellow at the Software Freedom Conservancy, he had spent years in the trenches of license compliance, negotiating with companies that violated copyleft terms, sometimes quietly, sometimes through the courts. He understood the GPL not as an abstract legal document but as a political instrument. It existed because the alternative, waiting for legislatures to protect software freedom, had never worked and probably never would.

When Copilot launched, Kuhn did the math. GitHub admitted that during training, the model had encountered a copy of the GPL more than seven hundred thousand times. That was not a rounding error. That was a corpus. And the model, having absorbed all that GPL-licensed code, was now producing output that users could paste into proprietary projects without any copyleft obligation at all.

Users almost surely must construct their own fair use or not copyrightable defenses for Copilot's output.

The argument was precise. If Copilot reproduced GPL-licensed code, and it did, sometimes word for word, then the output was potentially a derivative work under the GPL. That meant the project receiving the output would need to be GPL-licensed too. But Copilot did not tell you where the code came from. It did not attach a license. It did not warn you that the function it just suggested was lifted from a copyleft project. It simply gave you the code, and you, the developer, had no way of knowing whether pasting it into your proprietary application had just created a license violation.

GitHub's own analysis admitted that Copilot reproduced verbatim copies of its training data roughly zero point one percent of the time. That sounds small. But when millions of developers are generating millions of suggestions per day, zero point one percent is an enormous number of potential violations. And the examples that researchers found were not obscure edge cases. They were recognizable functions from well-known GPL projects, reproduced character for character, with no attribution and no license attached.

The Conservancy did not file a lawsuit. Instead, in February of two thousand twenty-two, they did something more methodical. They announced a Committee on AI-Assisted Programming and Copyleft, a ten-person group of legal scholars, license authors, and software freedom advocates tasked with developing a community response. The committee included Allison Randal, Karen Sandler, Stefano Zacchiroli, and Kuhn himself. Their mandate was to figure out whether copyleft could survive the age of machine learning, and if so, how.

We must assume a copyleft maximalist approach, until courts or the legislature disarm all mechanisms to control users' rights.

The phrase "copyleft maximalist" was deliberate. It meant treating Copilot's output as potentially encumbered by the licenses of its training data until someone proved otherwise. The burden of proof, Kuhn argued, should be on the company making billions from the code, not on the volunteers who wrote it.

The Laundering Metaphor

Drew DeVault chose a different word. Not infringement. Not violation. Laundering.

DeVault, the creator of SourceHut, a Git forge built on the principle that open source infrastructure should not depend on corporate platforms, published a blog post in June of two thousand twenty-two titled "GitHub Copilot and open source laundering." The metaphor was blunt and it stuck.

They built a tool which facilitates the large-scale laundering of free software into non-free software.

The argument went like this. A developer writes a function under the GPL. They publish it on GitHub. Copilot ingests it during training. Later, a different developer working on a proprietary project types a prompt, and Copilot suggests code that is functionally identical to the GPL-licensed original. The developer accepts the suggestion. The GPL-licensed code is now inside a proprietary project, with no license, no attribution, and no awareness on anyone's part that a violation just occurred. The neural network acted as the laundering mechanism. Copyleft code went in one end. Code with no obligations came out the other.

The only thing which is necessary to legally circumvent a free software license is to teach a machine learning algorithm to regurgitate a function.

DeVault's post raised a question that the legal system was not equipped to answer. Is a trained model a derivative work? If you feed a million GPL-licensed functions into a neural network, is the resulting model itself bound by the GPL? DeVault argued yes. The model exists as the result of applying an algorithm to those inputs, and thus the model itself is a derivative work. GitHub argued no. The model is a new creation, a statistical abstraction, not a copy. The training data shaped the model's behavior but is not contained within it in any meaningful legal sense.

Both arguments had logic behind them. Neither had case law behind it. Nobody had ever asked a court to rule on whether training a neural network on copyleft code creates a copyleft obligation. The law was a generation behind the technology.

The Lawsuit That Almost Was Not

On November third, two thousand twenty-two, attorney Matthew Butterick and the Joseph Saveri Law Firm filed the first class-action lawsuit over AI code generation. Doe v. GitHub, Inc. The plaintiffs were anonymous, identified only as Doe, representing a proposed class of potentially millions of GitHub users whose code had been used to train Copilot without their consent.

The complaint named three defendants. GitHub, for building and selling Copilot. Microsoft, for owning GitHub and funding the effort. And OpenAI, for building the Codex model that powered it. The claims were ambitious. Twenty-two of them. Violation of open source licenses. Breach of GitHub's own terms of service. Violation of the Digital Millennium Copyright Act for stripping copyright management information from the code Copilot reproduced. Unjust enrichment. Unfair competition. The lawsuit was not just about Copilot. It was a test case for the entire question of whether AI training on copyrighted material is legal.

Microsoft and GitHub moved to dismiss. Their argument was essentially that the plaintiffs could not show harm because they could not point to specific code that Copilot had stolen from them specifically. The model does not memorize code, the defense argued. It learns patterns. And patterns are not copyrightable.

The court partially agreed. Over the next two years, Judge Jon Tigar narrowed the case dramatically. In May of two thousand twenty-three, he let two major claims survive, the open source license violation and the DMCA claim about stripped copyright notices. But in June of two thousand twenty-four, even the DMCA claim fell. Tigar ruled that the code Copilot produced was not identical enough to the plaintiffs' code for the DMCA's protections to apply. Of the original twenty-two claims, two survived. An open source license violation and a breach of contract.

The plaintiffs appealed. In April of two thousand twenty-five, they filed their opening brief in the Ninth Circuit, arguing that the DMCA does not require identicality, that stripping a copyright notice is a violation even if the surrounding code has been slightly altered. The Ninth Circuit accepted the appeal. The district court proceedings are stayed, frozen in place, until the appeals court rules.

We want a version of Copilot that is friendlier to open source developers. Where participation is voluntary, or where coders are paid to contribute to the training corpus.

The case is still open. The Ninth Circuit's ruling on the DMCA identicality question will shape not just this lawsuit but every lawsuit about AI training on copyrighted material for years to come. And while the courts deliberate, GitHub keeps training. The code keeps flowing into the model. And the model keeps producing output that nobody can trace back to its origins.

The Opt-Out That Was Not

While the lawsuit crawled through the courts, GitHub made a gesture toward consent. They added an opt-out toggle. Developers could go into their account settings and check a box that said their code should not be used for Copilot training. The gesture was hollow in at least three ways.

First, the default was opted in. Your code was being used for training unless you actively chose otherwise. GitHub knew, as every platform company knows, that most users never change default settings. The opt-in default was not an accident. It was a business decision.

Second, opting out did not undo anything. If your code had already been used to train the model, checking the box did not remove it from the model's weights. You cannot untrain a neural network on specific inputs. The code was already baked in. The toggle only affected future training runs, and even that was a promise, not a technical guarantee.

Third, and most insidiously, the opt-out applied to your repositories, not to your code. If you wrote a function and published it under the GPL, and someone else included that function in their project, and that project was on GitHub with training enabled, your code got trained on anyway. The opt-out was at the repository level, not the code level. One developer in an organization who forgot to check the box could expose everyone's work.

In March of two thousand twenty-six, GitHub pushed the boundary further. They announced that starting April twenty-fourth, Copilot would use interaction data from Free, Pro, and Pro Plus users to train its AI models. Not just the code in public repositories. The code you typed while using Copilot. The suggestions you accepted. The modifications you made. The context surrounding your edits. Code from private repositories, if you were using Copilot while working in them. All of it, feeding the model, opted in by default.

The developer reaction was immediate and overwhelmingly negative. On GitHub's own discussion board, the announcement received fifty-nine thumbs-down votes and three positive reactions. Developers called it a dark pattern. They pointed out that individual users within a company typically do not have the authority to license their employer's source code to a third party, yet the opt-out was set at the user level, not the organization level. A single team member who did not change the default could expose proprietary code.

This was done to us, not for us, regardless of how they frame it.

GitHub's chief product officer offered the company's reasoning. Participation would help the models better understand development workflows and deliver more accurate suggestions. The message was clear. Your code makes the product better. The product makes money. You get a slightly improved autocomplete.

The Question Underneath the Question

The legal arguments about licenses and fair use and derivative works are important. They will be settled, eventually, by courts. But underneath the legal question is a philosophical one that no court can answer, and it is the question that makes this episode different from a contract dispute.

Open source worked because of a social contract. You write code. You publish it freely. Others use it, improve it, and publish their improvements. The license is the legal mechanism, but the social contract is the engine. Developers contributed to open source because they believed in the exchange. My work for your work. My library for your library. A rising tide of shared code that benefits everyone who participates.

Copilot broke the social contract without breaking the law, or at least without clearly breaking it. The code was public. The licenses, at least the permissive ones like MIT and BSD, arguably allowed any use, including training a neural network. The copyleft licenses were murkier, but the legal question was genuinely unsettled. Microsoft could plausibly argue that it had done nothing illegal. And yet something had clearly changed.

What changed was the power relationship. Before Copilot, open source was a gift economy where the gifts flowed in all directions. After Copilot, it was an extraction economy where the gifts flowed up. Developers wrote code and gave it away. GitHub collected it and trained a model. Microsoft sold the model back to the developers who wrote the code it was trained on. Ten dollars a month for access to the distilled output of millions of hours of volunteer labor. The developers who refused to pay still had their code inside the model. They just could not benefit from it.

The chilling effect was real but hard to measure. Some developers stopped contributing to public repositories. Some switched their licenses from permissive to copyleft, hoping the GPL would offer protection that the MIT license did not. Some moved their projects off GitHub entirely, to GitLab, to SourceHut, to Codeberg, platforms that did not train AI on hosted code. But most developers, the vast majority, did nothing. They kept coding, kept pushing to GitHub, kept contributing to the corpus that trained the model that competed with them. Because what else were they going to do? GitHub is where the developers are. The network effects are enormous. Leaving GitHub means leaving the largest collaboration platform in the history of software. The switching costs, the ones we talked about in episode thirty-eight, are the same ones that keep developers feeding the machine that feeds on them.

And this is what makes the Copilot problem a Git problem, not just an AI problem. GitHub did not train on a random collection of code scraped from the internet. It trained on Git repositories. Repositories with full histories. Repositories with commit messages that explain why each change was made. Repositories with blame annotations that link every line to the person who wrote it. Git's meticulous record-keeping, the feature that made collaboration possible, also made it the most perfectly structured training dataset in the history of machine learning. The very thing that made open source visible made it consumable.

In Season One, we traced how GitHub re-centralized what Git distributed. The tool that was supposed to make code free instead made it convenient, and convenience concentrated it on one platform. Now that platform has taken the next step. It did not just host the code. It learned from it. And it is selling what it learned.

What Comes Next

The Ninth Circuit will rule. The ruling will either establish that stripping a copyright notice from AI-reproduced code is a DMCA violation even without perfect identicality, or it will not. If it does, every AI training pipeline that uses copyrighted material will need to reckon with attribution requirements. If it does not, the legal framework for protecting open source from AI training will be, for practical purposes, gone.

But the legal outcome matters less than you might think. Even if the plaintiffs win everything, even if the court rules that Copilot violates open source licenses and the DMCA and contract law, the model has already been trained. The code is already inside it. You cannot un-learn a neural network. The most a court can do is impose damages and require changes to future training practices. The past is the past. The weights are set.

The deeper question is whether the social contract of open source can survive the age of AI. Whether developers will keep writing code and giving it away when they know that giving it away means feeding a machine that profits from their work without reciprocating. Whether the next generation of developers, the ones who grew up with Copilot, will even think of open source as a gift economy, or whether they will see it as a free resource to be mined.

The Redis story from episode thirty-seven asked who gets paid when open source creates enormous value. The Copilot problem asks the same question, but at a scale Redis never imagined. Redis was one company changing one license for one project. Copilot is one company extracting value from every open source project simultaneously. The fork button cannot solve this one. You cannot fork your way out of a trained model.

There is an irony that sits at the center of this story, and it is worth naming before we go. Git was built to track authorship. Every commit carries a name. Every line of code can be traced back to the person who wrote it. Linus Torvalds designed the system so that contributions could never be anonymous, so that credit and responsibility would be permanently attached to the work. And now the largest platform built on top of Git has found a way to strip that authorship completely. The code goes in with a name attached. It comes out with no name at all. The model does not remember who wrote what. It just produces code. The careful history that Git preserves, the commit messages, the blame log, the chain of authorship stretching back to the first line, all of it is dissolved in the training process.

Git was a system for remembering. The model is a system for forgetting.

That was episode thirty-nine of Git Good. Next time, we will meet the people who built the code that the world depends on, and who the world forgot in return.

Git log with the author flag filters the commit history by who wrote it. Type git log followed by two dashes and then author equals and a name, and Git shows only the commits from that person. It works with partial matches, so you do not need the full name. The interesting thing about this command in the context of AI-generated code is what it cannot show you. When a developer accepts a Copilot suggestion and commits it, the commit author is the developer, not the model. There is no flag for git log that shows you which lines were written by a human and which were suggested by a machine. The attribution system is honest about who pressed the commit button. It is silent about who wrote the code.