OpenSanctions: The Quiet Work of Knowing Who Is Who

Three People Named John Smith

Suppose you are looking at a list of company directors for a Swedish mining company. The list says the chairman is John Smith. You go to a sanctions database. You search for John Smith. You get fourteen hundred results. Some of them are sanctioned drug traffickers. Some of them are politically exposed persons. Some of them are perfectly innocent. Looking only at the name, you cannot tell which John Smith is your John Smith. The name is not enough.

This is the entity resolution problem, and it is one of the oldest unsolved problems in data work. The same person can appear in different databases with different name formats, different birth dates, different addresses, different transliterations of their name from another alphabet, different titles, different nicknames. The same company can appear in different filings with slightly different legal names. The same vessel can sail under three different flags in two years. The same property can be sold and resold under different ownership structures designed specifically to make the chain hard to follow.

The entire business of investigative journalism, at its operational core, is the business of saying with confidence that this John Smith is that John Smith. Or that he is not. The article cannot be written until that question is answered. The libel risk is enormous. The reputational risk is enormous. And no public dataset, no matter how comprehensive, will hand you the answer. You have to do the work.

OpenSanctions is a project that has industrialized this work for one specific slice of the problem, which is people and companies who appear on sanctions lists or who are politically exposed in some way. It was launched in twenty-twenty-one by Friedrich Lindenberg, the same person who built Aleph at OCCRP, and the same person who maintains the Follow The Money schema. He keeps showing up in this space because there are very few people in the world who care this much about entity resolution. Lindenberg is one of them. The work is mostly quiet, mostly invisible, mostly unglamorous, and almost entirely about getting one person's name aligned with another person's name across two different databases. It is also one of the most important pieces of infrastructure in modern investigative reporting.

What A Sanctions List Looks Like

To understand why OpenSanctions exists, you need to understand what a sanctions list looks like as a raw artifact. Sanctions lists are published by governments and international bodies. The European Union publishes one. The United Nations Security Council publishes one. The United States Office of Foreign Assets Control publishes one. The British government publishes one. Canada, Australia, Japan, Switzerland, and dozens of others publish their own.

These lists are not standardized. The European Union list is XML. The Office of Foreign Assets Control list is also XML, but a different XML. The United Nations list is yet another XML. Some governments publish in CSV. Some governments publish in PDF, which is not really a data format at all but a document format pretending to be a data format. Some governments publish only on a web page that you have to scrape.

Each list has its own conventions for how to spell a name. The same Russian oligarch might appear as Igor Sechin on one list and as Igor Ivanovich Sechin on another and as Сечин Игорь Иванович on a Russian-language source. Each list has its own conventions for how to express a date of birth, an address, a passport number. Some lists include identifiers that uniquely tag the person, like a tax identification number. Some lists include only a name and an estimated decade of birth. The variation is total.

If you are a bank trying to make sure you are not transferring money to anyone on any sanctions list, you have a serious problem. You need to check every transaction against every list. The names are spelled differently. The lists update at different times. The lists are in different formats. The number of false matches is enormous. The number of true matches that get missed because of a slight spelling difference is also enormous, and missing one of those can cost the bank hundreds of millions of dollars in fines.

For a long time, this problem was solved by buying expensive commercial software from companies like Refinitiv or Dow Jones. These companies took the raw lists from all the governments, did the painstaking work of merging them into a single unified database, and sold access to that database for very large sums of money. The sums were so large that only banks and major corporations could afford them. Investigative journalists, who have approximately the same screening problem as banks, simply could not afford the access.

The Free Alternative

Lindenberg's insight was that the merging work is fundamentally a public good. Every bank and every government is doing this work in parallel, slightly differently, paying for the privilege. The actual underlying data, the sanctions lists themselves, are public. What is locked up is the merge work. If the merge work could be done once, openly, and the result published under a license that allowed free use for non-commercial purposes, journalists and small organizations would have access to capabilities that previously were available only to multinational corporations.

[serious]

This is the OpenSanctions thesis. The project takes raw lists from over two hundred sources around the world, normalizes them into a single schema, deduplicates entities across lists, and publishes the result as both a downloadable dataset and a search API. The license is free for non-commercial use, including journalism, civil society research, and academic work. Commercial users pay a licensing fee, which is how the project sustains itself.

The result is something that did not previously exist in any accessible form. A unified database of every person or company who appears on any major sanctions list, plus politically exposed persons, plus criminal watchlists, plus relevant historical data going back several years. You can search it by name. You can search it by identifier. You can download the whole thing and run it offline. You can integrate it into your own pipelines.

For a working journalist investigating, say, the corporate ownership behind a Swedish mining exploration permit, OpenSanctions answers one specific question very well. Is any person or company in this story currently sanctioned by any major government? If the answer is no, the story is probably safe to publish without specific sanctions context. If the answer is yes, the story has just taken a sharp turn and you have a lot more reporting to do.

How The Merging Actually Happens

The merging work is the part that interests me most, because it is mostly not what you would expect. When you imagine combining hundreds of sanctions lists into one, you might imagine a clever algorithm doing fuzzy string matching and identifying that Sechin and Sechine and Ce-chin are probably the same person. The clever algorithm exists, but it is not the main thing.

The main thing is a piece of software called nomenklatura, also written by Lindenberg, also open source. The name is a joke that refers to the Soviet-era practice of maintaining official lists of approved personnel. In the OpenSanctions context, nomenklatura is the tool that maintains the project's official list of merged entities and their identifiers across all source datasets.

[calm]

When nomenklatura encounters two records that might be the same person, it does not just compare the names. It compares structured features. The names, normalized. The birth dates, if available. The countries of citizenship. The identifying numbers. The aliases. The known relationships to other entities. It scores the match. If the score is high enough, the records are automatically merged. If the score is in the uncertain middle, the records are flagged for human review. If the score is low, the records are kept separate.

The human review part is the unglamorous heart of the project. Someone, somewhere, has to look at two records and decide whether they are the same person. The software helps. The software does the easy ones automatically. The hard ones still come down to a human reading both records and exercising judgment. This is the part that does not scale, and this is the part that determines whether the resulting dataset is trustworthy or not.

OpenSanctions has been doing this work for nearly five years. The dataset has gradually become very good. The accuracy of the deduplication is high enough that the major commercial databases now reference OpenSanctions as a competitive benchmark. The project has not replaced the commercial databases, because banks need contractual liability that open data does not provide, but for everyone outside the bank market, OpenSanctions has changed what is possible.

Why The Schema Matters Again

The other piece of OpenSanctions worth knowing is that it stores its data in the Follow The Money schema, the same schema Aleph uses. This is not a coincidence. Lindenberg designed both. The two projects can talk to each other directly. An Aleph instance can ingest OpenSanctions data and immediately use it to enrich every entity in its investigation. A reporter using Aleph for a Swedish mineral permit investigation can have the sanctions check happen automatically as part of the ingestion, with no extra work.

This is the network effect of having an open schema in a small field. Every tool that adopts Follow The Money can interoperate with every other tool that adopts it. A new tool that comes along and uses the same schema gets all the existing data for free. The schema becomes the bottleneck and the asset. Whoever controls the schema controls how the field thinks about its data.

The fact that Lindenberg is one human being who built the schema, built Aleph, and built OpenSanctions, and who continues to maintain all three, is unusual. It means the field has a center. It also means the field has a single point of failure. If Lindenberg disappeared tomorrow, the schema would continue to exist as an open standard, but its evolution would slow significantly. This is not a unique situation. Many crucial open source projects have one or two load-bearing humans. It is worth being aware of it. It is also worth being grateful for it.

What This Has To Do With Local Reporting

A reasonable reaction to all of this is that sanctions screening is a problem for banks and for ICIJ-scale international investigations. Why does a local Swedish newspaper need to know about it?

The answer is twofold. The first is direct utility. The Swedish mining sector is increasingly dominated by Australian, Canadian, and British junior mining companies. Some of those companies have parent companies in jurisdictions known for opaque ownership. Some have shareholders or directors who have appeared on sanctions lists. Some have been involved in investigations elsewhere in the world. A quick check against OpenSanctions costs nothing and occasionally produces something worth knowing.

The second reason is more subtle. The methodology of OpenSanctions, the way it handles entity resolution, is exactly the methodology you would need to apply to your own data if you were maintaining a long-running investigation of any specific beat. A local reporter who covers the mineral exploration permits in one Swedish county is, over time, building up a personal database of companies, people, and relationships. The data needs to be deduplicated. The same company will appear in twenty different filings with slightly different names. The same person will sign documents using different spellings. Your investigation will be corrupted if you do not handle this.

Reading the OpenSanctions documentation and the nomenklatura source code is, in effect, a free apprenticeship in how to do entity resolution well. The choices the project has made about scoring, about manual review, about identifier preservation, about merging and unmerging, all of these are lessons that apply directly to a one-person investigative database. The project's lessons are not just useful for screening against global sanctions. They are useful for building any investigative database that grows over time.

The Quiet Permission, Again

There is a pattern in investigative tooling that is worth naming. The tools that matter most are not the ones with the slickest demos. They are the tools that solve unglamorous problems extremely well. Entity resolution is one of those problems. The work is tedious. The results are invisible. Nobody publishes an article that says, we deduplicated the corporate ownership database. They publish an article that says, the chairman of this company is the same person who was investigated in another country in twenty-eighteen. The deduplication is what made the article possible.

OpenSanctions has industrialized one slice of this unglamorous work and given it away to everyone who is not a bank. The downstream effect is that a working journalist in a small newsroom can now do a piece of due diligence that previously required a budget. The piece of due diligence might come up empty. Most of the time it does. But sometimes it does not, and the story bends sharply, and the journalism is better. That is the trade. That is the work.

The thing to take away, beyond the specific utility, is the pattern. The pattern is that the people doing the most important work in this field are doing slow, careful, tedious, schema-based, name-by-name matching work in basements somewhere, with very little glory and very little funding, because they care about the work. The articles you read in the New York Times or the Süddeutsche Zeitung are downstream of this work. The pattern is the same at every scale. The journalism is downstream of the matching. Whoever does the matching well is shaping what kind of journalism is possible.