Datashare: The Tool That Followed the Money

A Phone Call From An Anonymous Source

On a night in twenty-fifteen, a reporter at the German newspaper Süddeutsche Zeitung named Bastian Obermayer received a message from an anonymous source. The source said, simply, hello, this is John Doe. Interested in data? The reporter said yes. The data started arriving. Over the following months, the source delivered approximately two-and-a-half terabytes of internal documents from a Panamanian law firm called Mossack Fonseca. The documents contained the names of clients, the structures of shell companies, the flows of money, and the identities of people who used offshore arrangements to hide their wealth.

The Süddeutsche Zeitung could not investigate this alone. The data was in too many languages. The connections were in too many countries. The companies were registered in jurisdictions the German reporters did not know. Obermayer reached out to the International Consortium of Investigative Journalists, an organization headquartered in Washington that specializes in coordinating cross-border investigative collaborations. The consortium took the data, brought in reporters from dozens of countries, and organized what became the Panama Papers, the largest journalism collaboration in history.

[serious]

Working on the Panama Papers was not like working on a normal story. There were hundreds of reporters involved. The data set was enormous, much of it scanned PDFs in multiple languages. The reporters could not all be in one place. They needed to share what they found. They needed to be able to search the data from their own offices. They needed the work to be secure, because the source had reasons to fear retaliation and the reporters had reasons to fear leaks. The infrastructure for this kind of collaboration did not exist in any usable form when the project began.

The consortium built it. They used a commercial product called Linkurious for the network analysis. They used a custom server installation of Aleph for centralized searching. They built additional tools to handle specific parts of the workflow. The Panama Papers was published in April of twenty-sixteen. It changed careers. It changed governments. It also produced, as a side effect, a piece of software that the consortium has been refining ever since. They call it Datashare.

The Solo Reporter Problem

Datashare exists because the consortium realized, after the Panama Papers, that they had built a workflow for hundreds of reporters but had not built a workflow for one reporter. The infrastructure they had assembled required a team to run. The server needed someone to maintain it. The training to use the tools required time investments that only paid off at scale. A small newsroom or a freelance investigator could not use what the consortium had built. The tools were too big for them.

The consortium decided this was a problem worth solving. Most investigative work, they realized, was not done by international collaborations. Most investigative work was done by one or two reporters at a regional newspaper, working on a story that mattered locally but did not justify a global team. These reporters often received documents. They sometimes received leaks. They had to do roughly the same kind of work that the Panama Papers team had done, but at a smaller scale, with no infrastructure budget. The consortium had institutional knowledge about how to do this work. The institutional knowledge needed to be transferred into software that any reporter could use.

Datashare was the response. It is a self-hosted document analysis platform designed for a single user or a very small team. You install it on your own laptop or your own modest server. You drop documents into it. The system processes them, extracts entities, makes them searchable, lets you tag and annotate and connect them. The workflow is essentially the same as what the Panama Papers team did, but compressed to fit on a personal machine.

The license is the GNU Affero General Public License, which is the same family of license as DocumentCloud uses. It is open source. It is free. It comes with no warranty and no commercial support, but it works, and the consortium continues to develop it because they want the tool to exist for the journalists who need it.

How It Processes A Document Collection

When you give Datashare a collection of documents, the first thing it does is parse them. The documents might be in dozens of different formats. PDFs of scanned papers. Word documents. Excel spreadsheets. Email mailboxes. Plain text files. Image files containing text. The system handles all of these. It extracts the text from each. It runs optical character recognition on the scanned images. It parses the structured data from the spreadsheets. The output is a collection of plain-text representations of every document, with the original files still accessible for verification.

The next step is entity extraction. The system runs natural language processing across every document, looking for named entities. People. Companies. Places. Dates. Email addresses. Phone numbers. Money amounts. Identification numbers. Each entity is tagged with its type and its location in the source document. The same entity might appear in many documents, and the system tracks all the appearances.

The entities become the navigation structure of the collection. You can browse the documents by document, but you can also browse them by entity. Click on a person's name and see every document where that person appears, in chronological order, with the surrounding context. The collection is no longer a folder of files. It is a network of entities linked by their co-occurrence in documents.

This is genuinely powerful for investigative work. The reporter does not have to remember which documents contained which references. The system remembers. The reporter can focus on understanding the relationships rather than on bookkeeping. The same data, organized this way, reveals things that would have been invisible in a normal folder of files.

The Multilingual Capability

One specific feature of Datashare worth knowing about is its multilingual support. The system handles over forty languages out of the box. Each document is automatically detected for its language. The entity extraction uses the appropriate model for that language. The search results can be filtered by language. Documents in different languages can be analyzed together.

This matters more than it might seem. International business documents are often multilingual. A corporate filing might be in English. The underlying contracts might be in the language of the country where the company operates. The board minutes might be in a third language. A single investigation can easily span four or five languages. Without multilingual handling, the reporter has to do everything by hand, translating each document before analysis. With multilingual handling, the analysis happens in parallel across all the languages, and the reporter only has to read the specific passages that turn out to matter.

[calm]

For a Swedish reporter investigating an Australian mining company with operations in several countries, this is enormously useful. The documents will arrive in Swedish, English, possibly Spanish or Portuguese depending on the operations. The entities will still be recognized across all of them. The names of the same people and companies will appear in the appropriate spellings. The system stitches together a coherent investigation even when the underlying material is linguistically fragmented.

The Annotation Layer

The other thing Datashare provides, beyond extraction and search, is an annotation layer. You can highlight specific passages of specific documents and add notes. You can tag documents with custom labels. You can group related documents into named bundles. You can star important findings. The annotations live alongside the documents, searchable and shareable.

This is what turns the system from a search tool into a working investigation. The reporter does not just find documents. The reporter builds an interpretation of the documents. The interpretation lives in the annotations. When the reporter comes back to the investigation a month later, the annotations are still there, explaining what the reporter understood at the time. The institutional memory of the investigation is preserved.

For a long-running investigation, this is critical. The reporter who covers a local mining beat for years will accumulate hundreds of documents. The annotations are how the reporter keeps track of what each document means. Without the annotations, the documents are a pile. With the annotations, the documents are an investigation in progress, with a coherent thread that can be followed even after a long pause.

The Comparison To Aleph

Datashare and Aleph are siblings in a meaningful way. Both were inspired by the experience of the Panama Papers. Both use similar underlying technologies. Both handle multilingual document collections. Both extract entities. Both provide search and annotation. The difference is in scale and intended user.

Aleph is built for an organization with multiple investigations running in parallel and multiple reporters needing to collaborate. The server installation is larger. The administration is more complex. The benefit is that everyone in the organization can search across everything, with the right permissions, and discoveries in one investigation can inform others.

[serious]

Datashare is built for a single reporter or a small team working on a single investigation at a time. The installation is simpler. The administration is minimal. The benefit is that the tool gets out of the way and lets the reporter work. There is no organizational complexity. There is no permissions system to configure. There is just the documents, the entities, and the analysis.

For most working journalists, Datashare is the right choice. The investigation fits on one laptop. The reporter does not need to share access with anyone else. The setup time is small enough that it does not become a project in itself. The tool serves the journalism rather than demanding attention from the journalism.

The Security Story

There is one aspect of Datashare worth mentioning, which is its design for sensitive work. The tool runs locally on the reporter's machine. The documents never leave. The processing happens in place. The search index is local. The annotations are local. No third party sees any of it.

This matters for stories where the source needs protection or the documents contain information that adversaries would want to suppress. Sending sensitive documents to a cloud service is, in some cases, an unacceptable risk. The reporter cannot guarantee what happens to documents in a cloud. Even reputable services can be compelled to hand over data by court orders, hacked by skilled adversaries, or subjected to internal leaks. A local tool eliminates these risks.

The cost of local processing is performance. The reporter's laptop is not as fast as a dedicated server. Processing a thousand-page document collection might take an hour rather than a minute. For the kind of investigation where every advantage matters, this slowness is a real cost. For the kind of investigation where confidentiality matters more than speed, it is a price worth paying.

What This Has To Do With Working Journalists

The practical use of Datashare for a working reporter is to have it ready before the documents arrive. The setup is meaningful enough that you do not want to do it under deadline pressure. Install it on a working machine. Make sure it processes documents the way you expect. Get familiar with the interface. Practice on a small test collection. Then, when the actual investigation arrives, the tool is ready.

The kind of investigation Datashare is built for might happen rarely in a small newsroom. Most stories do not involve large document collections. But when a story does involve them, the difference between having the tool ready and not having it ready is the difference between doing the investigation well and not doing it at all. The tool is small infrastructure that pays off occasionally and dramatically.

For a reporter who works on a specific beat, like mineral exploration in a particular county, Datashare is also useful for the slow accumulation of documents over years. Every annual report, every regulatory filing, every news article gets fed into the system. The collection grows. The search becomes more useful. After several years, the reporter has a personal database of everything ever published about the beat, queryable in seconds, annotated with the reporter's accumulated understanding.

The Larger Pattern

The pattern that Datashare represents is the consortium's commitment to giving away its institutional knowledge. The Panama Papers team learned how to do large-scale collaborative investigation. They could have kept this knowledge proprietary, using it to maintain a competitive advantage over smaller publications. Instead, they decided to package it as software and give it to anyone who wanted it. The decision was deliberate. The justification was that journalism is better when more reporters can do good work, even reporters at small publications who would never join the consortium.

[calm]

This is a model worth honoring. The most experienced practitioners of a craft transfer their methods into software so that less experienced practitioners can benefit. The software is open, free, and maintained. The community of users grows. The journalism improves at every scale. Nobody captures all the value, but the cumulative value to society is larger than any private alternative would have produced.

For the Swedish reporter working on a small but real investigation, Datashare is one of the inheritances of this model. The reporter does not need to be part of the consortium to benefit from what the consortium learned. The lessons of the Panama Papers, distilled into software, are available to any reporter who needs them. The tools exist. The work continues. The journalism, at every scale, gets better because someone decided that better tools should be available to everyone. That is the model. That is the gift. That is what the open infrastructure of journalism makes possible.