OSS Scan: Entity Database / Cross-Investigation Identity Tools

Date: 2026-05-26 Capability target (C4): Durable entity identity across multiple investigations — so a non-mining story can discover a mining link via shared canonical entities. Constraints: Internal tool, never public-deployed. MIT/BSD/Apache preferred for code lift; AGPL acceptable as run-alongside service.

Tier 1 — FollowTheMoney ecosystem (the obvious spine)

1. FollowTheMoney (FtM)

URL: https://github.com/alephdata/followthemoney
What: Data model + Python lib for investigative entities (Person, Company, Asset, Address, Payment, Ownership, …) used by ICIJ, OCCRP, OpenSanctions.
License: MIT.
Status: Active (powers OpenSanctions weekly, OpenAleph rebuild).
Lift: The schema YAMLs (followthemoney/schema/*.yaml) — copy directly. Country/identifier/name normalization helpers. Entity proxy class.
Cross-investigation: Stable entity ID is content-derived (fingerprints lib over name+country+ID). Same Company in two FtM stores collapses on merge.
Concerns: Schema is rich but opinionated (corruption/sanctions bias). "Mining permit" or "real estate parcel" isn't first-class — you'll subclass or stretch Asset/License. Already on your list; included as anchor.

2. Nomenklatura

URL: https://github.com/opensanctions/nomenklatura
What: Entity reconciliation engine — dedupes and integrates FtM streams across sources. The "make X = X across corpora" component.
License: MIT.
Status: Active (heartbeat is OpenSanctions' weekly merge).
Lift: The whole linker (Resolver, Index, Judgement). This is the closest off-the-shelf answer to your C4 question.
Cross-investigation: Persistent Judgement store records human decisions ("these two refer to the same entity") — judgements survive across re-runs and new corpora. Exactly the cross-investigation memory you want.
Concerns: Tied to FtM types. Smaller community than dedupe/splink — Sequoyah/OpenSanctions team is effectively the maintainer.

3. OpenAleph

URL: https://github.com/openaleph/openaleph
What: Fork-continuation of Aleph by Data and Research Center (DARC); document store + entity index + cross-dataset search.
License: MIT.
Status: Active (post-OCCRP-handoff energy, 2025–2026).
Lift: The whole "documents → entities → cross-collection links" architecture; or just openaleph-search (FtM-in-Elasticsearch index module) as a library.
Cross-investigation: "Collections" model — each investigation is a collection, but entities cross-link via FtM IDs and xref. Built for this exact use case.
Concerns: Heavy stack (ES + Postgres + Redis + workers). Already on your list. Run-alongside, not code-lift.

4. Zavod

URL: https://github.com/opensanctions/opensanctions (zavod subdir / zavod PyPI)
What: Crawler framework for emitting FtM entities from sources (registries, sanctions lists, news).
License: MIT.
Status: Active.
Lift: Crawler base class + context.emit() pattern if you want to convert Bolagsverket / Lantmäteriet / SCB feeds into FtM streams.
Cross-investigation: Indirect — produces the entity streams that Nomenklatura merges.
Concerns: Optimized for the OpenSanctions data factory; smaller projects often outgrow or under-use it.

Tier 2 — Entity resolution libraries (the matching brain)

5. Splink

URL: https://github.com/moj-analytical-services/splink
What: Probabilistic record linkage at scale (Fellegi-Sunter on DuckDB/Spark/Postgres).
License: MIT.
Status: Very active (UK MoJ, Australian Bureau of Statistics 2026 Census).
Lift: The blocking + scoring pipeline; runs on DuckDB embedded so no infra. Best-in-class accuracy on names+addresses.
Cross-investigation: Unsupervised model trained once on your canonical entity table; re-score when new investigation arrives.
Concerns: SQL-backend means you express comparisons as SQL — fine, but a step away from "just call a Python function." No built-in persistent judgement store; you'd bolt one on.

6. dedupe

URL: https://github.com/dedupeio/dedupe
What: ML-based fuzzy matching / record linkage with active-learning UI ("is this pair the same?").
License: MIT.
Status: Maintained but slower cadence; the dedupe.io hosted product is the commercial side.
Lift: Active-learning loop is the gem — Pär can label 50 pairs and get a usable model. Pair this with FtM schema.
Cross-investigation: Persist the trained settings + a canonical-entity table; re-link new investigations against it.
Concerns: Older codebase. Memory-hungry on large corpora. Splink has overtaken it on scale.

7. recordlinkage (J. de Bruin)

URL: https://github.com/J535D165/recordlinkage
What: Pandas-native record linkage toolkit — blocking, comparison, classification.
License: BSD-3-Clause.
Status: Maintained, slow cadence.
Lift: Useful primitives (Jaro-Winkler/Levenshtein/Soundex wrappers, blocking indexer) for small/medium datasets where Splink is overkill.
Cross-investigation: None built-in — purely a matching library. You'd own the canonical store.
Concerns: Doesn't scale past ~1M pairs comfortably. Fine for Swedish-investigation sizes.

Tier 3 — Graph / knowledge stores for the entity layer

8. Kùzu

URL: https://github.com/kuzudb/kuzu
What: Embedded property graph DB (Cypher, vector + FTS built-in). DuckDB-style "no server."
License: MIT.
Status: Active; acquired by Apple (2025) — flag risk but code remains MIT.
Lift: Use as the entity-graph backend instead of Postgres+pgvector. Cypher for "show me all entities connected to company X across investigations."
Cross-investigation: Storage layer; identity logic still yours.
Concerns: Apple acquisition — pin a version. Schema migration story for an evolving entity model is immature.

9. TerminusDB

URL: https://github.com/terminusdb/terminusdb
What: Document graph DB with git-for-data (branches, diffs, merges of structured data).
License: Apache-2.0.
Status: Picked up by DFRNT Studio (2025) after a quiet stretch — alive but smaller community.
Lift: Branch-per-investigation, merge into the canonical entity graph. Pär's "investigations as branches" maps cleanly.
Cross-investigation: Native concept — branch divergence, then merge with conflict resolution on entity identity.
Concerns: Learning curve (WOQL or GraphQL). Bus-factor: small team. Worth a 1-day spike, not a default.

10. FalkorDB

URL: https://github.com/FalkorDB/FalkorDB
What: Redis-module property graph (formerly RedisGraph fork) using GraphBLAS sparse matrices; marketed for GraphRAG.
License: Server Side Public License (SSPL) — not OSI-approved, AGPL-adjacent restrictions. Treat as run-alongside only; do not lift code.
Status: Active.
Lift: Don't lift code. Use as service if benchmarks justify.
Cross-investigation: Storage only.
Concerns: SSPL license is the blocker. Kùzu is the more permissive twin.

Tier 4 — Investigative platforms (whole-app, run-alongside)

11. Vertex Synapse

URL: https://github.com/vertexproject/synapse
What: "Central intelligence system" — hypergraph data store + Storm query language + analyst workflow. Mature (started 2008-ish at Vertex).
License: Apache-2.0.
Status: Very active (regular releases).
Lift: The structured data model + Storm DSL is genuinely interesting for journalist queries ("show me every Company whose director also sat on a Permit board"). Or just steal the model design.
Cross-investigation: Native — every node has stable identity; "views" overlay private analyst work on a shared canonical layer.
Concerns: Cyber/threat-intel-shaped culture. Significant learning curve. Likely over-tooled for a one-journalist setup, but the view/layer model is worth reading.

12. OpenCTI

URL: https://github.com/OpenCTI-Platform/opencti
What: Threat-intel knowledge graph (STIX 2.1 model) with investigations, pivots, case management.
License: Community = Apache-2.0; Enterprise = proprietary EE license. Stick to CE.
Status: Very active (Filigran-backed).
Lift: The investigation/pivot UX, case management, and entity-merge UI are battle-tested. Borrow UI patterns even if you don't run it.
Cross-investigation: Strong — investigations are first-class, pivot across them on any entity.
Concerns: STIX schema is cyber-domain, doesn't map cleanly to mining/real-estate/political-graft. Heavy stack.

13. Liquid Investigations

URL: https://github.com/liquidinvestigations
What: Self-hosted bundle for cross-border journalist collaboration — Hoover (doc search), Nextcloud, wiki, chat.
License: MIT.
Status: Maintained, modest cadence; EIC.network's stack.
Lift: Hoover (search/OCR over heterogeneous corpora) is the interesting piece — entity model is thin though.
Cross-investigation: Weak on structured entities; strong on cross-corpus document search.
Concerns: Notes-and-docs flavor; not the entity spine you need. Useful as a companion to FtM, not a replacement.

Tier 5 — Second-brain with entity-shape (filtered)

14. Anytype

URL: https://github.com/anyproto (protocol MIT; apps source-available)
What: Local-first object database with typed objects, relations, and a schema-ish "Type" system.
License: Protocol MIT; apps source-available (not OSS). Self-hosting allowed.
Status: Very active.
Lift: UX inspiration only. The typed-object + relation model is the closest "second-brain with real entities" but you can't legally fork the app.
Cross-investigation: Spaces are siloed by default — cross-space entity identity is not a feature.
Concerns: License is the deal-breaker for code lift. Worth playing with for UX ideas; do not depend on.

15. Nordic Registry MCP Server (Sweden-specific)

URL: https://github.com/olgasafonova/nordic-registry-mcp-server
What: MCP server wrapping Bolagsverket (SE), Brønnøysund (NO), CVR (DK), PRH (FI) registries — board members, signing authority, bankruptcy.
License: Check repo (likely MIT; verify).
Status: New (2025/2026).
Lift: The Bolagsverket OAuth2 + endpoint plumbing — the most annoying part of Swedish company-graph work.
Cross-investigation: Not directly; it's an ingest source. Pipe its output through FtM → Nomenklatura.
Concerns: Single-maintainer project, young. Treat as reference implementation; expect to rewrite the parts you depend on.

Honest synthesis

The shortest path to C4 is FtM schema + Nomenklatura resolver + Kùzu (or Postgres) as graph store, with Splink for the heavy matching when you have >100k entities. Everything else on this list is either (a) UX inspiration, (b) a competing whole-app you'd run alongside not lift from, or (c) a source-of-data wrapper. The FtM ecosystem is genuinely the answer; the rest is decoration or alternatives if FtM's corruption-bias schema chafes.

One real risk: the FtM schema assumes anti-corruption framing. If Pär's mining investigations spawn lots of physical-world entities (claims, drill sites, parcels), you'll subclass Asset and may eventually fork. Plan for that fork at design time, not in year two.

Sources:

FollowTheMoney · Nomenklatura · OpenAleph · OpenSanctions/zavod
Splink · dedupe · recordlinkage
Kùzu · TerminusDB · FalkorDB
Vertex Synapse · OpenCTI · Liquid Investigations
Anytype · Nordic Registry MCP