ArchiveBox: A Personal Defense Against Link Rot

The Day The Link Stops Working

In two thousand fourteen, a researcher at Harvard Law School named Jonathan Zittrain published a paper on what he called link rot. The phenomenon was simple to describe. Every link on the internet has some chance of stopping working at some point. The web page it pointed to might be deleted. The website might shut down. The company might be acquired and reorganized. The article might be moved to a new address without a redirect. Whatever the cause, the link goes dead. The reader who clicks on it sees an error page.

The numbers in Zittrain's paper were striking. He studied citations in articles published in the Harvard Law Review and similar academic journals. The journals are supposed to be permanent records of legal scholarship. They are supposed to be citable for decades or centuries. The links inside the articles were, in many cases, not lasting more than a few years. Half the links in articles from two thousand to two thousand ten were already broken. The legal record was, in a quiet and ongoing way, deleting itself.

[serious]

The problem affects journalism even more severely than legal scholarship. A news article published today contains links to sources, to press releases, to corporate filings, to background coverage from other publications. Some of these links will be broken within a year. Most will be broken within five. The reader who wants to verify the article's claims, even shortly after publication, will find themselves clicking on dead URLs. The article remains. The evidence behind the article quietly disappears.

This is the problem that ArchiveBox tries to solve, in a specific and personal way. It is not the only tool that addresses link rot. The Internet Archive's Wayback Machine has been working on the problem for decades, preserving copies of websites for posterity. The Save Page Now feature lets anyone archive a specific page to the Wayback Machine. These services are essential. They are also not under your control. They depend on a non-profit organization that has its own funding crises, its own legal threats, its own technical limitations. For journalism that needs to be sure its sources will still exist in ten years, depending on a single external organization is risky.

ArchiveBox is the response. It is a self-hosted web archiving system. You run it on your own computer or your own server. You feed it URLs. It captures everything about each URL. The captures are stored locally, in your control, indexed and searchable. The system has been quietly accumulating users since its first release in two thousand seventeen, and it has become a small but important piece of the infrastructure of personal archiving for journalists and researchers around the world.

What A Capture Actually Contains

To understand what ArchiveBox does, you need to understand what it means to fully capture a web page. A page is not a single file. It is a collection of files. The main file is the HTML, which contains the structure and text. Then there are images, stylesheets, scripts, fonts, video clips, and dynamic data that the page fetches as it loads. A complete capture has to grab all of these and store them together, in a way that can be reassembled later.

ArchiveBox does this aggressively. When you give it a URL, it does not just save one version of the page. It saves multiple versions, in parallel. It saves the raw HTML. It saves a fully rendered version after JavaScript has run. It saves a screenshot of the page. It saves a PDF of the page. It saves the underlying HAR file with all the network requests. It saves a text-only extraction for full-text search. It saves the Wayback Machine's version, in case the local version is ever lost. Each of these is a different way of preserving the page, and each protects against different failure modes.

The screenshot is the simplest version. It is just an image of what the page looked like at the moment of capture. The image preserves the visual evidence of what was published. If the article you cited gets edited later to soften a quote or remove a paragraph, the screenshot shows what was originally there. The visual capture is the simplest defense against editorial revision.

The PDF is a slightly more useful version. It contains the layout of the page, but also the text underneath, which means it is searchable. You can grep through a folder of archived PDFs and find every page that mentions a specific term. The PDF captures more than the screenshot but takes more space and is sometimes less faithful to the original.

The fully rendered HTML is the most flexible version. It contains all the original elements, all the styling, all the scripts. With the right tools, it can be opened in a browser later and will look essentially identical to the original. This is the most useful version for serious archival work, but it is also the largest and most fragile. If any of the supporting resources are missing, the rendering breaks down.

ArchiveBox saves all of these by default. The argument is that storage is cheap, and the cost of saving multiple formats is much smaller than the cost of having archived a page in only one format that later turns out to be the wrong one.

The Indexing And Search

The other thing ArchiveBox does that simple page saving does not is index everything for search. Each captured page is processed through optical character recognition where needed. The text is extracted from PDFs. The HAR files are parsed. The metadata is recorded. The result is a personal search engine over everything you have ever archived.

This is more useful than it might sound. When you write a journalism piece, you cite sources. The sources might number in the dozens for a single article. Over years of writing, you accumulate thousands of sources. The sources contain information that you might want to refer back to. A name. A date. A specific phrase. A statistical claim. A previous quote from a public figure.

[calm]

Without an archive, finding this information again means re-reading source articles or re-searching the web, with all the link rot problems that implies. With an archive, the information is in your local search index. You type a name. You get back every page you have ever archived that mentions that name, with the context, the date you archived it, and the original source URL. The archive becomes a personal research database that grows with every story you write.

For long-running investigative work, this compounds. A reporter who has been covering a specific beat for ten years, archiving every relevant page, has a personal database that is genuinely unique. Nobody else has the same combination of sources, the same context, the same accumulated knowledge. The archive is a competitive advantage. It is also a defense against the slow erasure of the web, where the sources that supported your previous articles are quietly disappearing while you are working on new ones.

The Self-Hosting Argument

The thing that distinguishes ArchiveBox from cloud-based archival services is that it runs on your own infrastructure. The captures are stored on your hard drive or your server. The search index is local. The data does not leave your control.

There are several reasons this matters. The first is durability. Cloud services can shut down or change pricing or change terms. If your archive is in a cloud service that disappears, your archive disappears with it. A local archive is yours forever, regardless of what happens to any external company.

The second is privacy. Cloud archival services know what you are archiving. For most journalism, this does not matter much. For sensitive investigative work, where the choice of what to archive might itself be a clue, this matters. A self-hosted archive reveals nothing to any third party. The reporter's research is entirely private.

The third is comprehensiveness. Cloud services have rate limits, storage limits, and rules about what they will and will not archive. A self-hosted archive is limited only by your storage. You can archive entire websites, hours of video, large data dumps. The cost is your disk space. The benefit is that nothing important is left out.

The cost of self-hosting is that you have to do the hosting. You have to install the software, run it, back up the data, handle updates, deal with the occasional technical problem. For a journalist who is not technically inclined, this can be a real barrier. ArchiveBox has been working to make this easier over time, but the tool is still meaningfully harder to use than a click-to-save cloud service.

The Composition With Journalism Workflows

The way ArchiveBox fits into a working journalism practice is straightforward. You set up ArchiveBox once, either on your laptop or on a small server. You configure it to be your default archival destination. When you encounter a source while researching, you archive it. The archive runs in the background. Within seconds, the page is captured in your archive. You continue researching.

When you write the article, you cite the original URL. You also keep the archived version in your records. If the original URL goes dead, you have the local copy. If a reader challenges you on what the source said, you can show them the local copy. The journalism is defended by the archive.

For sources that are particularly important, you can mirror the local copy to a public location. ArchiveBox supports exporting the captures as a static website that anyone can browse. The reporter can choose to publish a copy of each key source alongside the article, so that readers do not have to take anything on faith. This is the strongest version of the show-your-work pattern. The source is right there, in the reporter's archive, with the article.

The pattern composes with the rest of the open data journalism stack. The archives can be referenced from notebooks. The screenshots can be embedded in articles. The search index can be queried from analysis scripts. The whole stack becomes more coherent when there is a personal archive sitting underneath, capturing everything as it happens.

The Larger Argument

There is a philosophical argument that ArchiveBox embodies, which is worth saying explicitly. The argument is that journalism has a long-term responsibility to the sources it cites. The article you publish today is not just a thing to be read this week. It is a thing that will be read, occasionally, by readers and researchers for decades to come. Some of those readers will want to check your sources. The sources need to still exist for the readers to check them.

[serious]

The current state of web journalism is not consistent with this responsibility. Most articles cite URLs without archiving them. The URLs go dead. The article becomes uncheckable. The journalism becomes weaker over time, not because the journalism was wrong, but because the evidence that supported it has been allowed to disappear. The articles in the New York Times' archives from two thousand to two thousand ten are full of dead links to sources that no longer exist. The articles are not less true than they were. They are less defensible.

The personal archiving practice that ArchiveBox enables is a way of taking responsibility for this. The reporter archives sources as they use them. The archive lives on the reporter's own infrastructure. The sources persist as long as the reporter does, and longer if the archive is preserved or transferred. The journalism gets stronger over time, not weaker, because the evidence accumulates rather than evaporates.

This is the practice that distinguishes a working reporter from a hobbyist. The working reporter keeps records. The records become the foundation of future stories. The future stories cite back to the records, building a coherent body of work that defends itself against challenge. The hobbyist publishes articles into the void and forgets them. The reporter publishes articles into a continuing project that will still be coherent in twenty years.

What This Has To Do With Working Journalists

The practical move, for a journalist who has not yet adopted a personal archiving practice, is to start small. Install ArchiveBox. Archive the next five sources you cite. See how it feels. The setup cost is real but small. The ongoing cost is essentially zero. The benefit accumulates.

The deeper move is to think about archiving as part of the journalism itself, not as an optional add-on. The reporter who archives sources is doing journalism. The reporter who does not archive sources is doing journalism with a self-destructing component. The articles are the same in the moment. The articles are different in five years. The difference is the archive.

[calm]

For a working reporter in a small newsroom, where there is no archival department, no librarian, no institutional memory, the personal archive is the only memory the publication has. The articles in the back catalog point to sources that may already be gone. The articles being written today should not have the same problem. ArchiveBox, or any equivalent personal archive, is the response. It is small infrastructure that pays back enormously over time, by making the journalism more durable, more checkable, and more honest. The Wayback Machine cannot do everything. The reporter has to do some of it. The tools exist. The investment is modest. The work compounds.