Before two thousand nine, when an American newspaper wrote an article based on a government document, the standard practice for sharing the document with readers was the worst possible standard practice. The newspaper would convert the document to a PDF, sometimes scanned, sometimes a mix of text and images, often hundreds of pages long. They would upload it to their own server. They would put a small link at the bottom of the article saying, read the full document here. The reader, if they were unusually motivated, would click the link, wait for a many-megabyte file to download, open it in their PDF reader, and try to find the specific passage the article had referenced.
[serious]
In practice, no reader did any of this. The link existed for institutional legitimacy. The article said, here is our source. The newspaper's lawyers were satisfied. The transparency requirement was technically met. Nothing else happened. The document sat on the server, unread, decaying with link rot, eventually disappearing entirely when the website got redesigned three years later.
This was the state of source publishing in journalism. It was a ritual gesture rather than a working practice. And then, in the spring of two thousand eight, a team at the New York Times, working on a massive document dump from the United States presidential election, built a tool to handle their immediate problem. They called it Document Viewer. It let them embed pages of the dump directly inside their articles, with annotations, with search, with the ability for readers to click on a specific page rather than downloading the entire file.
A few months later, two editors from ProPublica, Scott Klein and Eric Umansky, saw what the New York Times team had built. They asked if it could be shared. The conversation that followed produced DocumentCloud, which has now been operational for over fifteen years and has fundamentally changed how investigative journalism handles its sources.
The technical problem DocumentCloud addresses is small to describe and large in consequence. When you have a long document and you want to point a reader at a specific passage, you need three things. You need the document hosted somewhere stable. You need a way to point at a specific page. And you need a way to annotate that page so the reader knows what they are looking at without having to read the surrounding hundred pages.
DocumentCloud does all three. You upload a document. The service runs optical character recognition on it, which means even scanned documents become searchable. The service hosts the document on its own infrastructure, which means link rot does not happen the way it used to. The service generates a stable URL for every page of every document. The service lets you draw annotations on specific passages, with your own commentary, which appear as overlays for any reader visiting the page.
The result is that an article can say, here is what the chairman wrote in his nineteen-ninety-nine memo, and embed the actual page of the memo, with the relevant passage highlighted, right inside the article. The reader sees the source. The reader sees the highlight. The reader does not have to download anything. The transparency is not a ritual gesture. It is a working part of the article.
This sounds like a small change. It is not. It is the difference between an article that asserts something and an article that demonstrates something. The reader's experience shifts. The accountability of the journalism shifts. The libel risk to the publisher shifts. The credibility of the work shifts. All of this happens because one technical problem, the problem of pointing at a specific passage of a specific document, got solved well.
The optical character recognition piece is worth pausing on, because it is the part that does the most invisible work. Optical character recognition, often shortened to OCR, is the technology that converts an image of text into actual searchable text. It has been around for decades. It used to be slow and unreliable. Modern OCR is fast and very reliable for most documents.
DocumentCloud runs OCR on every uploaded document by default. This means that even a scanned PDF of a fax of a typewriter-produced government report from nineteen-seventy-three becomes searchable. You can type a word into the search box and find every page of every document where that word appears. The OCR is imperfect on old or damaged documents. Some letters are misread. Some words are missed. But for most documents, the search works well enough to be the primary way reporters discover what is inside their own document collections.
[calm]
This changes the nature of investigative work. Before searchable OCR, a reporter with a thousand-page document dump had to read the whole thing. Or skim it. Or hope that an index existed. Now the reporter can type a name, a date, a phrase, and immediately find every relevant passage across every document. The work that used to take weeks of reading can be done in hours of querying.
DocumentCloud is not the only tool that does this. The technology is available in many forms now. But DocumentCloud was one of the earliest places where it was made available to journalists for free, at scale, with a workflow designed for the way reporters actually use documents. That early access mattered. It shaped a generation of investigative reporters who now expect their documents to be searchable as a baseline.
There is one specific feature of DocumentCloud that deserves its own segment, because it reveals something important about how documents work as digital objects. The feature is called Bad Redactions.
When a government agency releases a document with sensitive information, they redact the sensitive parts. In the analog world, this meant taking a black marker and physically obscuring the text. In the digital world, the redaction is usually done by drawing a black rectangle over the offending passage in a PDF editor. The black rectangle is added as a visual element on top of the original text. The original text still exists in the document's text layer. The black rectangle is sitting on top of it like a sticker.
This is fine if you only look at the document visually. The black rectangle covers the text. You see only the rectangle. But if you copy and paste the text underneath the rectangle, or if you run OCR on the underlying text layer, the supposedly redacted information is still there, perfectly readable. The rectangle was decorative, not destructive.
This is a remarkably common mistake. Government agencies have been making it for years, all over the world. Sensitive information has leaked through bad redactions in court filings, in police reports, in intelligence documents, in corporate disclosures. The leak happens because the people doing the redactions do not understand that the visual rectangle is not the same as actually deleting the text.
DocumentCloud has a feature, an add-on called Bad Redactions, that automatically scans uploaded documents for this pattern. It identifies black rectangles. It checks whether searchable text exists underneath them. If so, it extracts the text and surfaces it. The reporter sees, at the top of the document, a list of every passage that the agency thought it had redacted but had not actually redacted. The feature has produced multiple notable stories. One investigation in Brazil, for example, used Bad Redactions to identify small companies that had been fined for smuggling endangered Brazilian timber, where the names had been technically obscured but were perfectly readable in the underlying text.
The same feature can also be used in reverse. If a reporter wants to actually redact something before publishing a document, the feature offers to permanently remove the text from the document, not just visually obscure it. The reporter clicks a button. The text is gone. There is no underlying layer to leak through.
This is what good investigative tooling looks like. It addresses a specific real problem that real journalists have. It uses the technical structure of the documents to surface what was already there. It works in both directions, finding bad redactions and helping make good ones. Most of all, it does the work for the reporter, so the reporter can do journalism instead of becoming a PDF security expert.
DocumentCloud is now operated by MuckRock, a non-profit organization that started as a service for filing Freedom of Information Act requests in the United States. The connection makes sense. MuckRock helps reporters get documents from the government. DocumentCloud helps reporters work with those documents once they have them. The combination is one of the more useful pairings in the contemporary American investigative ecosystem.
MuckRock took over DocumentCloud from a previous home that was, for a while, run inside the Internet Archive. The transitions have not always been smooth. The service has had funding crises. It has changed pricing models. It has occasionally been threatened with shutdown. The fragility is worth knowing about. DocumentCloud is mission-critical infrastructure for investigative journalism, but it is run by a small non-profit with a small budget. If MuckRock failed, the alternative would not be obvious.
This is the situation with much of the open journalism infrastructure. Small organizations, small budgets, mission-critical work. The contrast with the commercial sector is stark. Banks have their compliance tools. Newsrooms have DocumentCloud and a non-profit. The work gets done because people care. The work is fragile because people are not paid much to care.
If you are running a small newspaper in Sweden, you do not need to run a DocumentCloud instance yourself. You might not even need an account, depending on your publishing model. But the practice that DocumentCloud has institutionalized is worth adopting in your own form.
[serious]
When you write an article that references a public document, link to the actual document. Not the press release about the document. Not a summary of the document. The actual document, hosted somewhere stable. If you can highlight the specific passage you are citing, even better. If the document is in Swedish and your audience is Swedish, the value is the same.
This is more than transparency theater. It is a practical way to build trust with readers and to defend yourself against accusations of misrepresentation. The reader who clicks through and reads the source is a reader who comes back. The reader who tries to fact-check you and finds that your sources are exactly what you said they are, is a reader who has just decided you are credible. Over a long time, this practice compounds. Your articles become more durable. Your reputation becomes more solid. Your work gets referenced by other journalists because they can confirm what you said.
The infrastructure for this can be very simple. Host the documents on your own site. Link directly to them. Add captions. Annotate when useful. If your content management system supports embedding PDFs, embed them. If it does not, link clearly. The technical part is not what matters. The discipline of always linking to the source, every time, is what matters.
DocumentCloud represents a specific argument about what journalism should be. The argument is that journalism is most credible when it shows its work. The reader should be able to verify the sources. The reader should be able to read the underlying documents. The reader should not have to take the journalist's word for anything that can be demonstrated.
This is a stronger position than the journalism profession has historically held. The traditional model was that the journalist filtered the world for the reader. The reader trusted the journalist's filtering. The sources were available to the journalist privately. The reader saw only the finished article.
The DocumentCloud model is that the journalist still filters the world, but the filtering is now transparent. The reader can see what was filtered out. The reader can verify that the filtering was honest. The reader can disagree with the filtering and arrive at a different interpretation. The journalism remains valuable because the filtering is still labor and still skill, but the journalism is no longer a closed system.
For a working reporter in twenty-twenty-six, this is the higher standard. Transparency by default. Sources linked, not described. Documents annotated, not summarized. Quotes pulled directly, not paraphrased. The technical tools have made this possible. The cultural shift is still slow but moving. The reporters who adopt the practice early are the ones whose work survives the longest. That is the takeaway. Show your work. Always link to the source. Let the reader see what you saw. The article becomes stronger. The journalism becomes more honest. The whole structure benefits, slowly, one annotated document at a time.