Nomenklatura's "Judgement" store lets investigators flag when two entities are the same across separate cases—and those decisions survive every re-run, building a canonical identity map that connects mining permits to shell company networks.
OSS Scan: Entity Database / Cross-Investigation Identity Tools
Date: 2026-05-26
Capability target (C4): Durable entity identity across multiple investigations — so a non-mining story can discover a mining link via shared canonical entities.
Constraints: Internal tool, never public-deployed. MIT/BSD/Apache preferred for code lift; AGPL acceptable as run-alongside service.
Tier 1 — FollowTheMoney ecosystem (the obvious spine)
1. FollowTheMoney (FtM)
- URL: https://github.com/alephdata/followthemoney
- What: Data model + Python lib for investigative entities (Person, Company, Asset, Address, Payment, Ownership, …) used by ICIJ, OCCRP, OpenSanctions.
- License: MIT.
- Status: Active (powers OpenSanctions weekly, OpenAleph rebuild).
- Lift: The schema YAMLs (
followthemoney/schema/*.yaml) — copy directly. Country/identifier/name normalization helpers. Entity proxy class.
- Cross-investigation: Stable entity ID is content-derived (
fingerprints lib over name+country+ID). Same Company in two FtM stores collapses on merge.
- Concerns: Schema is rich but opinionated (corruption/sanctions bias). "Mining permit" or "real estate parcel" isn't first-class — you'll subclass or stretch
Asset/License. Already on your list; included as anchor.
2. Nomenklatura
- URL: https://github.com/opensanctions/nomenklatura
- What: Entity reconciliation engine — dedupes and integrates FtM streams across sources. The "make X = X across corpora" component.
- License: MIT.
- Status: Active (heartbeat is OpenSanctions' weekly merge).
- Lift: The whole linker (
Resolver, Index, Judgement). This is the closest off-the-shelf answer to your C4 question.
- Cross-investigation: Persistent
Judgement store records human decisions ("these two refer to the same entity") — judgements survive across re-runs and new corpora. Exactly the cross-investigation memory you want.
- Concerns: Tied to FtM types. Smaller community than dedupe/splink — Sequoyah/OpenSanctions team is effectively the maintainer.
3. OpenAleph
- URL: https://github.com/openaleph/openaleph
- What: Fork-continuation of Aleph by Data and Research Center (DARC); document store + entity index + cross-dataset search.
- License: MIT.
- Status: Active (post-OCCRP-handoff energy, 2025–2026).
- Lift: The whole "documents → entities → cross-collection links" architecture; or just
openaleph-search (FtM-in-Elasticsearch index module) as a library.
- Cross-investigation: "Collections" model — each investigation is a collection, but entities cross-link via FtM IDs and xref. Built for this exact use case.
- Concerns: Heavy stack (ES + Postgres + Redis + workers). Already on your list. Run-alongside, not code-lift.
4. Zavod
- URL: https://github.com/opensanctions/opensanctions (zavod subdir /
zavod PyPI)
- What: Crawler framework for emitting FtM entities from sources (registries, sanctions lists, news).
- License: MIT.
- Status: Active.
- Lift: Crawler base class +
context.emit() pattern if you want to convert Bolagsverket / Lantmäteriet / SCB feeds into FtM streams.
- Cross-investigation: Indirect — produces the entity streams that Nomenklatura merges.
- Concerns: Optimized for the OpenSanctions data factory; smaller projects often outgrow or under-use it.
Tier 2 — Entity resolution libraries (the matching brain)
5. Splink
- URL: https://github.com/moj-analytical-services/splink
- What: Probabilistic record linkage at scale (Fellegi-Sunter on DuckDB/Spark/Postgres).
- License: MIT.
- Status: Very active (UK MoJ, Australian Bureau of Statistics 2026 Census).
- Lift: The blocking + scoring pipeline; runs on DuckDB embedded so no infra. Best-in-class accuracy on names+addresses.
- Cross-investigation: Unsupervised model trained once on your canonical entity table; re-score when new investigation arrives.
- Concerns: SQL-backend means you express comparisons as SQL — fine, but a step away from "just call a Python function." No built-in persistent judgement store; you'd bolt one on.
6. dedupe
- URL: https://github.com/dedupeio/dedupe
- What: ML-based fuzzy matching / record linkage with active-learning UI ("is this pair the same?").
- License: MIT.
- Status: Maintained but slower cadence; the dedupe.io hosted product is the commercial side.
- Lift: Active-learning loop is the gem — Pär can label 50 pairs and get a usable model. Pair this with FtM schema.
- Cross-investigation: Persist the trained settings + a canonical-entity table; re-link new investigations against it.
- Concerns: Older codebase. Memory-hungry on large corpora. Splink has overtaken it on scale.
7. recordlinkage (J. de Bruin)
- URL: https://github.com/J535D165/recordlinkage
- What: Pandas-native record linkage toolkit — blocking, comparison, classification.
- License: BSD-3-Clause.
- Status: Maintained, slow cadence.
- Lift: Useful primitives (Jaro-Winkler/Levenshtein/Soundex wrappers, blocking indexer) for small/medium datasets where Splink is overkill.
- Cross-investigation: None built-in — purely a matching library. You'd own the canonical store.
- Concerns: Doesn't scale past ~1M pairs comfortably. Fine for Swedish-investigation sizes.
Tier 3 — Graph / knowledge stores for the entity layer
8. Kùzu
- URL: https://github.com/kuzudb/kuzu
- What: Embedded property graph DB (Cypher, vector + FTS built-in). DuckDB-style "no server."
- License: MIT.
- Status: Active; acquired by Apple (2025) — flag risk but code remains MIT.
- Lift: Use as the entity-graph backend instead of Postgres+pgvector. Cypher for "show me all entities connected to company X across investigations."
- Cross-investigation: Storage layer; identity logic still yours.
- Concerns: Apple acquisition — pin a version. Schema migration story for an evolving entity model is immature.
9. TerminusDB
- URL: https://github.com/terminusdb/terminusdb
- What: Document graph DB with git-for-data (branches, diffs, merges of structured data).
- License: Apache-2.0.
- Status: Picked up by DFRNT Studio (2025) after a quiet stretch — alive but smaller community.
- Lift: Branch-per-investigation, merge into the canonical entity graph. Pär's "investigations as branches" maps cleanly.
- Cross-investigation: Native concept — branch divergence, then merge with conflict resolution on entity identity.
- Concerns: Learning curve (WOQL or GraphQL). Bus-factor: small team. Worth a 1-day spike, not a default.
10. FalkorDB
- URL: https://github.com/FalkorDB/FalkorDB
- What: Redis-module property graph (formerly RedisGraph fork) using GraphBLAS sparse matrices; marketed for GraphRAG.
- License: Server Side Public License (SSPL) — not OSI-approved, AGPL-adjacent restrictions. Treat as run-alongside only; do not lift code.
- Status: Active.
- Lift: Don't lift code. Use as service if benchmarks justify.
- Cross-investigation: Storage only.
- Concerns: SSPL license is the blocker. Kùzu is the more permissive twin.
Tier 4 — Investigative platforms (whole-app, run-alongside)
11. Vertex Synapse
- URL: https://github.com/vertexproject/synapse
- What: "Central intelligence system" — hypergraph data store + Storm query language + analyst workflow. Mature (started 2008-ish at Vertex).
- License: Apache-2.0.
- Status: Very active (regular releases).
- Lift: The structured data model + Storm DSL is genuinely interesting for journalist queries ("show me every Company whose director also sat on a Permit board"). Or just steal the model design.
- Cross-investigation: Native — every node has stable identity; "views" overlay private analyst work on a shared canonical layer.
- Concerns: Cyber/threat-intel-shaped culture. Significant learning curve. Likely over-tooled for a one-journalist setup, but the view/layer model is worth reading.
12. OpenCTI
- URL: https://github.com/OpenCTI-Platform/opencti
- What: Threat-intel knowledge graph (STIX 2.1 model) with investigations, pivots, case management.
- License: Community = Apache-2.0; Enterprise = proprietary EE license. Stick to CE.
- Status: Very active (Filigran-backed).
- Lift: The investigation/pivot UX, case management, and entity-merge UI are battle-tested. Borrow UI patterns even if you don't run it.
- Cross-investigation: Strong — investigations are first-class, pivot across them on any entity.
- Concerns: STIX schema is cyber-domain, doesn't map cleanly to mining/real-estate/political-graft. Heavy stack.
13. Liquid Investigations
- URL: https://github.com/liquidinvestigations
- What: Self-hosted bundle for cross-border journalist collaboration — Hoover (doc search), Nextcloud, wiki, chat.
- License: MIT.
- Status: Maintained, modest cadence; EIC.network's stack.
- Lift: Hoover (search/OCR over heterogeneous corpora) is the interesting piece — entity model is thin though.
- Cross-investigation: Weak on structured entities; strong on cross-corpus document search.
- Concerns: Notes-and-docs flavor; not the entity spine you need. Useful as a companion to FtM, not a replacement.
Tier 5 — Second-brain with entity-shape (filtered)
14. Anytype
- URL: https://github.com/anyproto (protocol MIT; apps source-available)
- What: Local-first object database with typed objects, relations, and a schema-ish "Type" system.
- License: Protocol MIT; apps source-available (not OSS). Self-hosting allowed.
- Status: Very active.
- Lift: UX inspiration only. The typed-object + relation model is the closest "second-brain with real entities" but you can't legally fork the app.
- Cross-investigation: Spaces are siloed by default — cross-space entity identity is not a feature.
- Concerns: License is the deal-breaker for code lift. Worth playing with for UX ideas; do not depend on.
15. Nordic Registry MCP Server (Sweden-specific)
- URL: https://github.com/olgasafonova/nordic-registry-mcp-server
- What: MCP server wrapping Bolagsverket (SE), Brønnøysund (NO), CVR (DK), PRH (FI) registries — board members, signing authority, bankruptcy.
- License: Check repo (likely MIT; verify).
- Status: New (2025/2026).
- Lift: The Bolagsverket OAuth2 + endpoint plumbing — the most annoying part of Swedish company-graph work.
- Cross-investigation: Not directly; it's an ingest source. Pipe its output through FtM → Nomenklatura.
- Concerns: Single-maintainer project, young. Treat as reference implementation; expect to rewrite the parts you depend on.
Honest synthesis
The shortest path to C4 is FtM schema + Nomenklatura resolver + Kùzu (or Postgres) as graph store, with Splink for the heavy matching when you have >100k entities. Everything else on this list is either (a) UX inspiration, (b) a competing whole-app you'd run alongside not lift from, or (c) a source-of-data wrapper. The FtM ecosystem is genuinely the answer; the rest is decoration or alternatives if FtM's corruption-bias schema chafes.
One real risk: the FtM schema assumes anti-corruption framing. If Pär's mining investigations spawn lots of physical-world entities (claims, drill sites, parcels), you'll subclass Asset and may eventually fork. Plan for that fork at design time, not in year two.
Sources: