Shot-Scraper: When the Web Became a Lie

The Day Curl Stopped Working

Sometime around two thousand fifteen, give or take a year, working journalists who scraped data from government websites started running into a strange problem. The scraping tools they had been using for years stopped working. The websites they were trying to scrape still loaded fine in a browser. The data was still there. The reporter could see it on the page. But when the reporter's script fetched the page, the script got back an almost empty document. The data was missing.

The reason for this was that the web had quietly changed. For the first decade or so of its existence, a web page was a self-contained document. The server sent the browser a chunk of HTML. The HTML contained all the text, all the structure, all the data. The browser rendered it. The page was complete the moment it loaded.

Over time, this stopped being true. Web developers learned that pages could be made faster, or more interactive, or more dynamic, by sending less data initially and then loading more data after the page started rendering. A small program written in JavaScript would run in the browser, would make additional requests to the server for the actual data, and would then construct the page out of that data in real time. The user did not notice. The page just appeared the way it always had. But the page was no longer a self-contained document. The page was a small application that fetched its own contents.

For scrapers, this was a disaster. The traditional scraping tools, like the command-line program called curl or the Python library called requests, fetched the page the way the browser initially did. They got the initial empty document. They did not run the JavaScript. They did not load the data. They got back nothing useful. The web had become invisible to them.

This is the problem shot-scraper was built to solve. It is a small tool by Simon Willison, the same person who popularized git-scraping. It addresses one specific and increasingly common technical problem with elegance and practicality, and it has become a quiet staple of modern data journalism workflows.

What Headless Browsers Actually Are

To understand shot-scraper, you need to understand a category of software called headless browsers. A normal browser, the kind you use every day, displays web pages to a human. It has a user interface, a window, address bar, tabs, bookmarks. A headless browser is exactly the same thing minus the user interface. It loads pages, runs the JavaScript on those pages, executes the same network requests, applies the same styling, but does not display any of it to a screen.

[calm]

The point of a headless browser is that it can be controlled by a program. A script can tell the headless browser to open a page, wait for the JavaScript to finish loading, click a button, scroll down, fill in a form, and then extract whatever data the script wants. From the perspective of the website being visited, the headless browser looks exactly like a real browser. The JavaScript runs. The data loads. The page assembles itself. Everything works the way it would for a real user.

The most popular headless browser engine in current use is called Chromium, which is the open-source core of Google Chrome. Mozilla provides a similar headless version of Firefox. Apple provides a similar interface for Safari. There is a standardized way to control these browsers from scripts, which is called the WebDriver protocol or, in its newer form, a library called Playwright that supports all three major browsers.

Playwright is the underlying engine for shot-scraper. The shot-scraper tool is a small wrapper around Playwright that exposes the most common scraping operations as a simple command line interface. You type a single command. The headless browser launches in the background. The page loads. The JavaScript runs. The data is extracted. The browser quits. You have your output.

The Specific Trick

The core trick of shot-scraper is that it does not try to be a general-purpose scraping platform. It does one specific thing, which is, given a web address and a description of what you want, give back the result as either a screenshot, a PDF, a chunk of HTML, or a piece of extracted structured data.

The screenshot mode is the most basic and most useful. You give the tool an address. The tool launches a headless Chromium. The tool waits for the page to fully load, including all the JavaScript. The tool takes a picture of the rendered page and saves it as a PNG or JPEG file. This is incredibly useful for the kind of work where you want to preserve what a website looked like on a specific day. If you are writing about a government press release, taking a screenshot of the release as it was actually published, including any errors that get quietly fixed later, is a valuable archival practice.

The HTML mode is the next step. The tool fetches the page, waits for the JavaScript, and returns the final, fully-rendered HTML. This is what the page looked like at the moment of full load, after all the dynamic content has been inserted. You can then run normal HTML parsing tools on this output, the same way you would have scraped pages a decade ago. The difference is that the HTML now contains the dynamically-loaded data, because the JavaScript has had a chance to run.

The structured data mode uses a feature called CSS selectors. You give the tool a path through the page, like, find the third table inside the section with this identifier, and then extract the second column of every row. The tool runs that query against the rendered page and returns the matching data as JSON. This is the most useful mode for ongoing scraping, because you are pulling structured data directly rather than trying to parse it out of HTML after the fact.

The combination of these modes makes shot-scraper suitable for almost any scraping problem on the modern web. If you can see it in a browser, you can scrape it with shot-scraper. The barrier between human-visible content and machine-extractable content has been removed.

The HAR Mode And Why It Matters

There is one mode of shot-scraper that deserves its own segment, because it solves a problem that even most experienced scrapers have not realized is solvable. The mode is called HAR, which is short for HTTP Archive. A HAR file is a complete record of every network request a browser made while loading a page, including the request headers, the response headers, the response bodies, and the timing.

When you ask shot-scraper to fetch a page with HAR mode enabled, it returns not just the final rendered HTML but the full HAR file as well. This means you can see every individual data source the page loaded behind the scenes. Often the most useful data on a modern page is not in the final HTML at all. It is in one of the JSON responses that the JavaScript fetched and used to construct the page. By capturing the HAR, you get direct access to that JSON, in its original structured form, without any of the noise of the rendered page.

[serious]

For a journalist scraping a government mapping tool, this is enormously valuable. The map application loads a polygon dataset from a hidden internal endpoint. The polygon dataset is what you actually want. The map is just the visual representation of it. With HAR mode, you can capture the polygon dataset directly, in machine-readable form, without trying to reverse-engineer it from the rendered map. The work that used to require careful inspection of network traffic in browser developer tools can now be done in a single command.

The Compositional Power

The thing that makes shot-scraper especially useful is that it composes well with other tools. The output of shot-scraper can be piped directly into git, which means you can combine it with the git-scraping pattern from earlier. You can write a GitHub Actions workflow that runs shot-scraper every day against a page that loads its data via JavaScript, captures the HAR file with the underlying data, extracts the relevant JSON, and commits the result to a repository. The whole pipeline, from fetching to storing to versioning, is automated.

The combination is more powerful than either tool alone. Git-scraping by itself works only on pages whose data is in the initial HTML. Shot-scraper by itself does not have the change-detection mechanism. Together, they handle the entire class of modern web data sources. A page that loads its data dynamically becomes, with one line of configuration, a tracked, versioned, queryable history.

The Cloudflare Problem

There is one ongoing arms race worth being aware of. As scraping tools have become more capable, websites have become more aggressive about blocking them. The most common blocking mechanism is Cloudflare, a content delivery network that sits in front of many government and corporate sites. Cloudflare tries to detect whether the incoming request comes from a human in a normal browser or from a script. If it suspects a script, it presents a challenge that humans can pass but scripts usually cannot.

Shot-scraper, because it uses a real browser engine, passes most of these challenges automatically. The challenge is testing whether your browser is real. The browser is real, even if there is no human looking at it. Most of the time, the page loads normally.

[calm]

But Cloudflare has been getting more aggressive. Some sites now check whether the browser is being controlled by a script. Some sites check whether the originating internet address is from a known data center. Some sites require a manual click on a button before they will reveal their data. The arms race is ongoing, and there is no single solution.

Willison has documented one approach that works for some cases. He runs the shot-scraper requests through a Tailscale exit node, which routes the traffic through a residential internet connection rather than a data center. This makes the requests look like they are coming from a home user rather than from an automated system. It is a small piece of plumbing but it has saved several specific scraping projects from being blocked entirely.

The larger point is that scraping has become a craft with its own bag of tricks. Shot-scraper handles most situations well, but the edge cases require thinking about how the target site is detecting automation, and finding workarounds that respect the site's intent while still serving the journalism. This is not a fully automated discipline. It is a craft that combines tools and judgment, the same way most journalism does.

What This Has To Do With Working Journalists

For a working journalist in twenty-twenty-six, the practical use of shot-scraper is that the modern web is finally tractable for scraping again. The years when JavaScript-heavy sites were hostile to journalism are over. The tools have caught up. The barrier is now choosing what to scrape, not figuring out how to scrape it.

This matters because the most interesting data sources on the modern web are exactly the ones that are JavaScript-heavy. Government mapping tools. Interactive dashboards. Real-time financial feeds. Search interfaces over large databases. These are the places where the data is most valuable and where traditional scraping fails. Shot-scraper handles all of them.

For ongoing work, the pattern is the same as with git-scraping. Pick a specific source that matters to your beat. Write a small shot-scraper command that fetches the relevant data. Schedule it to run daily. Commit the results to a repository. Watch the diff feed for interesting changes. The journalism happens around the diffs.

The Larger Pattern

There is something philosophically interesting about what shot-scraper represents. The original promise of the web was that pages were documents, addressable by URLs, parseable by any tool. The modern web has drifted away from this. Pages are now applications. The data is hidden behind layers of dynamic loading. The original promise has been technically broken, even if it has been preserved in spirit.

[serious]

Shot-scraper, and the headless browser approach generally, is the restoration. By using a real browser to fetch the page, the original promise is recovered. The data is accessible again. The web is parseable again. The tool is more complex than the simple HTTP fetchers of the early years, but it does the same job. The document is restored to a document, by running it through a browser first.

This is a pattern worth noticing. Sometimes a tool that looks like a workaround is actually a restoration. The simple-fetch model used to work. Then the web complicated itself. The headless browser model handles the complication and gives you back a simple-fetch result. The complexity is hidden inside the tool. The user gets back the original simplicity.

For a journalist trying to do investigative work on data that lives on modern websites, this restoration is what makes the work possible. The page that looked unscrapeable yesterday is scrapeable today, with one command. The data is there. The web has not actually moved beyond the journalist's reach. The tools have just had to catch up. They have caught up. The work continues.