Git-Scraping: The Daily Diff as a Story

A Tree Falls in San Francisco

Sometime in late twenty-twenty, a man named Simon Willison noticed that the city of San Francisco maintains a public database of every street tree on every public block. The database is online, downloadable, updated by someone at City Hall whenever a tree is planted, removed, replaced, or reclassified. Willison is the kind of person who notices databases the way some people notice good restaurants. He decided to start tracking it.

What he wanted was not a copy of the database. He wanted to know when it changed. He wrote a small script that downloaded the file once a day and committed it to a private repository on his GitHub account. He scheduled the script to run automatically. Then he forgot about it.

Years later, he still has not forgotten about it. Most workdays, somebody at San Francisco City Hall edits the tree database. Willison does not know who. He has tried to figure out which department is responsible and has so far failed. But every day, when the script runs, it captures whatever the unknown city employee changed. The trees that were removed. The trees that were added. The species reclassified from one taxonomy to another. Five years of these changes are now sitting in a git repository, every change time-stamped, every change attributable to a specific commit, every change reversible if you want to see what the database looked like on a Tuesday in twenty-twenty-two.

This is git-scraping. It is one of the strangest and most useful patterns to come out of journalism technology in the last decade. It has no commercial vendor. It has no proprietary tool. It is a method, named by Willison, popularized through his blog, and now used by hundreds of journalists, researchers, and obsessives around the world.

Why The Diff Is The Artifact

The thing that takes a moment to absorb about git-scraping is that the database you build is not the point. The data sitting in the repository today is interesting only as a snapshot. What is genuinely valuable is the history. The five hundred commits that show, day by day, what changed.

Imagine the Swedish Mining Authority publishes a daily list of valid exploration permits. The list is online. Anyone can download it. On Monday it has eight hundred and forty-seven entries. On Tuesday it has eight hundred and forty-nine. Two new permits were granted. If you only had Monday's list, and then Tuesday's list, you could compare them and notice the change. But you would have to be paying attention on the right day. If you missed Tuesday and downloaded Wednesday's list instead, you would still see the new permits, but you would not know exactly when they were added. You would not know if any permits were revoked during the week and re-added.

Git-scraping captures every change as it happens, automatically, without you paying attention. The script runs at midnight. It downloads the list. If the list has not changed, nothing happens. If the list has changed, git commits the new version, and the commit message becomes a record of when. Over time, the repository becomes a complete history of how the database evolved. Every change in every field of every record is preserved forever. You can browse it the way you browse any git repository, jumping back to any point in time and looking at exactly what the data showed.

[calm]

This is profoundly different from how government databases are usually consumed. Most people who use the Mining Authority's permit list look at it once, for a current question, and forget it. The list changes silently behind their back. The journalism is in the snapshot. Git-scraping turns the journalism into the history. The story is no longer what the database shows today. The story is what changed.

Why The Pattern Works

The reason git-scraping became a pattern, and not just a clever trick Willison did once, is that git is unreasonably well-suited to the job. Git was built to track changes in source code. Source code changes one line at a time, in human-readable text, with the changes meaningful to a reader. A public dataset, if you save it in the right format, changes the same way. A row gets added. A field gets edited. A record disappears.

If you store the dataset as a JSON file, prettily formatted with one record per line, then git's normal diff tools work on it directly. You can run git log and see every change. You can run git diff between any two commits and see exactly what was added, removed, or modified. You can blame any specific line and see when it last changed. All of this is free, because git already does it for source code, and the data is just being treated as a strange form of source code.

The other piece that makes git-scraping practical is GitHub Actions. GitHub provides free continuous integration minutes for public repositories. You can write a workflow file that says, run this script every day at midnight, and GitHub will run it for you on their servers, forever, at no cost. You do not need a server. You do not need a database. You do not need to maintain infrastructure. Your scraper lives in the same repository as your data, runs on someone else's machine, and writes its output back to itself.

For a one-person newsroom, this combination is almost too good to be true. You can run dozens of git scrapers, each tracking a different public dataset, for the cost of zero dollars per month. The only thing you pay is attention, and only when something interesting happens.

The Editorial Sensor Pattern

Here is the part of git-scraping that turns it from a clever trick into a journalism methodology. The git repository is not the product. The product is what you do when the repository changes in an interesting way.

The basic move is this. You write the scraper. The scraper runs daily. Every commit is a change. You subscribe to the commit feed using an RSS reader, or you set up an email notification, or you write a small script that monitors the repo and pings you when commits happen. Every time a commit lands, you take a quick look. Most commits are boring. A small change to one field. A formatting tweak. Nothing worth a story.

But sometimes the commit is interesting. A new permit appears that you have been watching for. A field changes value in a way that suggests something happened. A record disappears that should not have. When this happens, the commit itself becomes a lead. You investigate. Maybe nothing comes of it. Maybe a small notis in the paper. Maybe a larger story, weeks down the road, that started because the scraper noticed something at midnight when you were asleep.

This is what investigative reporters in the United States have started calling the editorial sensor pattern. The scraper is a sensor. It monitors a public data source you care about. Its job is not to produce articles. Its job is to alert you when reality changes in a way that is worth your attention. The articles, when they happen, are downstream of the sensor's alerts. The sensor is internal infrastructure. The articles are the journalism.

A famous example of this in the United States is the Big Cases Bot, written by a reporter named Brad Heath. The bot watches federal court filings in cases involving important public figures. When a new filing appears, the bot grabs it, uploads it to DocumentCloud, and tweets out a notification. Brad Heath does not write the bot every day. He wrote it once. It runs forever. It generates leads. The journalism is downstream of the leads.

Why Narrowness Is The Strategy

It would be tempting, faced with the power of git-scraping, to scrape everything. The whole national permit database, the whole national company registry, every public filing from every regulator. This is a mistake, and a common one, and worth talking about.

The problem with scraping everything is that the alerts become useless. If a thousand things change in your scraper every day, you cannot read a thousand alerts. You stop reading them. The system collapses into noise. The sensor was supposed to filter the world down to what you care about. Instead it has filtered the world up to overwhelm.

[serious]

The strategy that works is narrowness. You scrape a small, specific slice of the world that you actually care about. Mineral permits in one county. Building permits for one neighborhood. Court filings in one specific kind of case. Corporate filings for a specific set of companies you are tracking. When changes happen in your narrow slice, they almost always matter, because you only chose to scrape things that matter.

This is the moat for a small newsroom. A national paper can scrape everything because they have enough reporters to triage the firehose. A one-person paper cannot. But a one-person paper has a different advantage. The reporter knows the local context. The reporter knows that this permit holder is the brother-in-law of the previous owner. The reporter knows that the company in question has been on the planning board's agenda twice in three years. The narrow scraper plus the local knowledge produces leads that no national paper could find, because no national paper has the context to interpret the diff.

What Git-Scraping Does Not Replace

The thing git-scraping does not do is the actual reporting. The scraper notices that something changed. It does not know why. It does not know whether the change matters. It does not know who to call about it. It does not know what the previous owner of the permit said in two thousand eighteen. The scraper is a sensor. It generates leads. The leads are the beginning of journalism, not the end.

This is worth saying because there is a strong temptation, when you have built a sensor, to publish the sensor's output as if it were a story. To run a daily blog of changes. To tweet every diff. To turn the repository into a public dashboard. This is almost always a bad idea. The output of the sensor is interesting to you, the reporter, because you have context. It is boring to a reader who does not have your context. The dashboard becomes another piece of infrastructure that nobody looks at, occupying the same psychological space as the original undifferentiated database.

The journalism is in the article you write when the sensor catches something important. The article frames the change. The article tells the reader why this particular thing in this particular database matters. The article does the work of translating from raw data to public understanding. The scraper is upstream of all that. The scraper is essentially a clock radio that wakes you up when something happens. You do not publish your clock radio.

The Long Half-Life of A Good Sensor

The reason git-scraping deserves to be understood by anyone doing data-driven local journalism is that, once built, a sensor lasts forever with almost no maintenance. Willison's tree scraper has been running for years. It has needed almost no attention. The GitHub Actions runner keeps running. The script keeps fetching. The data keeps committing. Every day it does its job. Every now and then it produces something interesting.

For a one-person newsroom, this is the dream. You do not have the budget to hire a data team. You do not have the time to manually check every database every day. But you have the attention to write one scraper for one specific data source you care about, and then to glance at the commit feed in your morning coffee. The labor scales with the number of sensors you set up, but each individual sensor is almost free to maintain after the first day.

A reasonable goal, over the course of a working year, might be to have ten to twenty narrow sensors running on data sources relevant to your beat. Mineral permits, building permits, planning applications, corporate filings, court filings, government press releases. Each sensor is a small commitment. Most of the time, none of them produce anything interesting. But every few months, one of them catches something. And because you have ten to twenty of them, every few months becomes every few weeks. The cadence of leads becomes self-sustaining.

This is what data journalism looks like for a working reporter who is not at a national paper. It is not big interactive features and viral charts. It is a quiet wall of small sensors, running in the background, occasionally pinging you with something worth a phone call. The tree database in San Francisco has not produced a single article that Willison has written. But it might, someday. And the cost of waiting is essentially zero. That is the deal git-scraping offers. Patience plus automation, paid in pennies and curiosity, redeemable in stories you would not have found any other way.