Swedish mining investigation platform needs entity deduplication engine: FollowTheMoney + Nomenklatura (both MIT) form the core schema and resolver, letting journalists cross-link suspects, companies, and assets across multiple investigations without AGPL entanglement.
OSS Investigation-Substrate Scan
Scanned 2026-05-26 for a Pär-grade journalist investigation substrate (successor to gruvor). Filter: permissive license preferred (MIT/BSD/Apache); AGPL OK as run-alongside; non-commercial / Do-No-Harm flagged as blockers for code lift. Already known and excluded: OpenAleph, Datashare, gitscrape.
Ranking is by lift-value (how directly we'd reuse code/schema in a new repo), not by general fame.
Tier 1 — Lift code directly
1. FollowTheMoney (alephdata/opensanctions fork)
- URL: https://github.com/opensanctions/followthemoney (active fork; alephdata/followthemoney is the historical home)
- One-liner: Python data model + ontology for investigative entities (Person, Company, Asset, Payment, Vessel, CourtCase, Document...).
- License: MIT. Active: pushed 2026-05-22, 67 stars on fork, broader use everywhere.
- Lift: the entire schema.
Thing → LegalEntity → Person/Company/Organization, properties with types (name, country, address, identifier, date), RDF/OWL serialization. Use as the entity table shape; map gruvor mining entities (Bolag, Gruva, Förekomst) onto Company/Asset/Location.
- Covers: C4 (cross-investigation requires a shared entity vocab — FtM is it), C6 (schema lets "unknown" entities land as
Thing and get refined).
- Concerns: opinionated toward financial-crime use-cases; Swedish-place / mineral-deposit nuance not modelled — extend with custom schemata. Schema is XML-ish YAML, slight learning curve.
2. Nomenklatura
- URL: https://github.com/opensanctions/nomenklatura
- One-liner: Entity deduplication and cross-dataset resolver built on FtM.
- License: MIT. Active: 2026-05-22, 242 stars.
- Lift: the Resolver graph (judgements as edges, canonical IDs), blocking, the matcher/scorer interfaces. This is precisely the engine C4 needs — given a new ingest, find candidate matches against the union of past investigations.
- Covers: C4 (core), C6 (resolver tolerates partial entities).
- Concerns: assumes you've already standardised on FtM (which makes #1 a prerequisite, not a complication).
3. Yente
- URL: https://github.com/opensanctions/yente
- One-liner: FastAPI service exposing entity search + bulk-match + Reconciliation-API over an FtM index (Elasticsearch/OpenSearch backend).
- License: MIT. Active: 2026-05-26, 133 stars.
- Lift: drop-in search/match API. If the new substrate stays FtM-shaped, yente gives us OpenRefine-compatible reconciliation for free.
- Covers: C4, C6, indirectly C3 (faceted browse via ES).
- Concerns: brings Elasticsearch as infra; for a single-journalist install that's heavy — but it's the simplest path to "search across all investigations". Run-alongside, not lift-code.
4. vis-timeline
- URL: https://github.com/visjs/vis-timeline
- One-liner: Mature interactive timeline JS library — items, ranges, groups, zoomable.
- License: Apache-2.0 OR MIT (dual). Active: 2026-05-21, 2.5k stars.
- Lift: the renderer for C1. Plain library, no architectural lock-in. Pair with a Leaflet map and bbox filter for the "spatial filter by area, not point" requirement.
- Covers: C1 primarily.
- Concerns: none significant; vanilla JS, but Svelte/React wrappers are trivial.
5. forensic-architecture/timemap
- URL: https://github.com/forensic-architecture/timemap
- One-liner: Reference frontend that already combines Leaflet map + d3 timeline + tag/category filters for incident exploration.
- License: Do No Harm (custom, derived from BSD-3) — not OSI-approved, not strictly permissive. Pushed 2026-06 (latest commit dated 2025-06 in API but recent enough), 377 stars.
- Lift: UI patterns and the time-space-tag triad as a layout reference. Lift visual/interaction design; rewrite the code rather than vendoring, given license ambiguity.
- Covers: C1 (gold-standard reference), C5 (events-as-first-class works for press articles too).
- Concerns: Do No Harm license is a yellow flag for any redistributable derivative — treat as inspiration, not dependency. Pär: I have no problem with this license and there is no redistribute need.
6. memorious
- URL: https://github.com/alephdata/memorious
- One-liner: Distributed scraper framework (Celery + Python) used across the OCCRP stack for source-site crawling.
- License: MIT. Active: 2026-05-20, 315 stars.
- Lift: crawler/scheduler primitives if Pär wants to monitor sources (Bolagsverket pages, SGU registries, kommun protokoll). Direct fit for the "press articles as first-class entities" pipeline — point it at arebladet.se + competitors.
- Covers: C5 (ingest press), C6 (defers parsing).
- Concerns: opinionated around Celery; smaller alternative is a plain
httpx + apscheduler script if scale is one journalist.
7. Label Studio
- URL: https://github.com/HumanSignal/label-studio
- One-liner: Self-hosted multi-format annotation UI with NER, relations, classification, PDF rendering, ML backends.
- License: Apache-2.0. Active: 2026-05-26, 27k stars.
- Lift: C2 in a box — the per-doc-type review UI for human-validated AI parses. Either run alongside (its labelling config maps cleanly to FtM properties) or lift its labelling-config DSL.
- Covers: C2 (direct), C6 (label-on-arrival is exactly the workflow).
- Concerns: heavy (Django + React + Postgres). For a one-journalist tool it's overkill; but the labelling-config XML schema is genuinely worth lifting even if we rebuild the UI.
Tier 2 — Borrow patterns / vendor a subsystem
8. Aleph (alephdata/aleph)
- URL: https://github.com/alephdata/aleph
- One-liner: Document + structured-data search across investigations with entity cross-referencing.
- License: MIT. Status: official maintenance ends Dec 2025; OpenAleph (already on your list) is the soft fork.
- Lift: cross-reference UI patterns (
profile API for "merge these two entities?"), the document-rendering pipeline. Most concrete bits are already in OpenAleph; revisit if a specific subsystem is cleaner upstream.
- Covers: C3, C4.
- Concerns: large, opinionated stack (ES + Postgres + Redis + RabbitMQ). Sunsetting branch — borrow code, don't deploy.
9. Paperless-ngx
- URL: https://github.com/paperless-ngx/paperless-ngx
- One-liner: Self-hosted doc management with OCR (Tesseract), auto-tagging, correspondents, types, web UI.
- License: GPL-3.0. Active: 2026-05-26, 41k stars.
- Lift: ingest watch-folder pattern, OCR pipeline wiring (tesseract + ocrmypdf glue), the "correspondent / type / tag" 3-axis model — a useful frame for C2 schemas.
- Covers: C2 (ingest UI), C3 (curated browse), C6.
- Concerns: GPL-3.0 means any direct code lift infects the new repo. Use as run-alongside ingest, or read for patterns. No FtM-style entity layer — it stops at tags.
10. OpenCTI
- URL: https://github.com/OpenCTI-Platform/opencti
- One-liner: Knowledge-hypergraph platform for cyber threat intelligence; STIX 2.1 schema; React+GraphQL+ElasticSearch.
- License: Apache-2.0 (Community Edition; EE is separate). Active: 2026-05-26, 9.4k stars.
- Lift: the Investigation UI (entity selection, graph expansion, knowledge-graph navigator) is a strong reference for C4. Frontend patterns lift cleanly; backend is heavy.
- Covers: C4 (architectural reference), C1 (it has time-aware entity views).
- Concerns: STIX schema is cyber-shaped (Indicator, Malware, AttackPattern) — wrong vocab for mining. Use the UX, not the schema. Big stack, foreign-domain assumptions.
11. INCEpTION
- URL: https://github.com/inception-project/inception
- One-liner: Semantic annotation platform with knowledge-base linking, recommenders, active learning.
- License: Apache-2.0. Active: 2026-05-26, 696 stars.
- Lift: KB-linking UI is the closest thing to "highlight a name, link to existing entity, or create new" that C2 + C4 need together. Strong adjudication/inter-annotator features.
- Covers: C2, C4 (entity linking).
- Concerns: Java/Spring stack — won't lift directly into a Python/JS substrate, but the UX is worth screenshotting and reimplementing.
12. Hoover (liquidinvestigations/hoover-search)
- URL: https://github.com/liquidinvestigations/hoover-search
- One-liner: Backend for searching large doc collections — Tika + ES — used by EIC for Football Leaks etc.
- License: MIT. Active: 2024-09 (going stale), 21 stars on this repo (project sprawls across siblings).
- Lift: the simpler-than-Aleph search backbone if a journalist wants ES-backed full-text without the FtM weight. Plays well as a microservice.
- Covers: C3.
- Concerns: stale upstream, niche community. Probably skip unless we specifically want a leaner Aleph.
13. doccano
- URL: https://github.com/doccano/doccano
- One-liner: Lightweight web annotation tool (text classification, NER, seq2seq).
- License: MIT. Active: 2026-04-14, 10.6k stars.
- Lift: simpler-than-Label-Studio alternative for C2 if Pär only needs NER and classification, not relations. Schema-to-UI mapping is clean Django/Vue.
- Covers: C2.
- Concerns: no PDF rendering, no relation annotation, no ML backend — Label Studio supersedes it for this use-case. Listed only as a fallback.
14. CJWorkbench
- URL: https://github.com/CJWorkbench/cjworkbench
- One-liner: Reproducible data-journalism pipeline modules (scrape → clean → analyse → publish).
- License: CC BY-NC-4.0 (non-commercial).
- Status: pushed 2024-12, effectively stale.
- Lift: pipeline-module pattern (each module declares params + transform) is worth a glance for C6's ingest-then-defer architecture. Don't vendor code — non-commercial license is a hard block even for internal tools you might later open-source.
- Concerns: license + staleness. Listed for completeness; skip.
Excluded after closer look
- PANO (ALW1EZ/PANO): CC BY-NC-4.0 — non-commercial, blocks lift. PySide6 desktop app; wrong shape.
- DocumentCloud (MuckRock fork): AGPL-3.0, Rails + Ember. Run-alongside only; UI not worth replicating.
- Khoj: AGPL-3.0, AI-second-brain frame doesn't model entities/timelines as first-class. Wrong abstraction.
- Dgraph / Memgraph / Cayley: graph DBs, not investigation tools. If we eventually need one, FtM serializes to RDF and any of them works — but no entity model, no ingest, no UI. Defer.
- Bellingcat toolkit: a curated list of tools, not a substrate. Useful index, nothing to lift.
Suggested composition
Build a Python+SQLite (or Postgres) + Svelte substrate. Lift:
- followthemoney schema (extend with mining-domain types: Mineral, Förekomst, Provborrhål).
- nomenklatura Resolver for cross-investigation entity matching (C4).
- vis-timeline + Leaflet for C1, with bbox spatial filter wired directly.
- Label Studio config DSL (lift idea, not code) for per-doc-type review C2 — render in Svelte against an FtM-typed payload.
- memorious patterns (or a 200-line equivalent) for ingesting Årebladet + competitor RSS as
Article FtM entities (C5).
- Optionally run yente alongside for the search/reconcile API once corpus passes ~10k entities.
Everything load-bearing is MIT or Apache-2.0. No AGPL in the lift path. Forensic-Architecture's timemap is the design reference; the rest is plumbing.
Sources
- https://github.com/opensanctions/followthemoney
- https://github.com/opensanctions/nomenklatura
- https://github.com/opensanctions/yente
- https://github.com/alephdata/aleph
- https://github.com/alephdata/memorious
- https://github.com/visjs/vis-timeline
- https://github.com/forensic-architecture/timemap
- https://github.com/HumanSignal/label-studio
- https://github.com/doccano/doccano
- https://github.com/inception-project/inception
- https://github.com/paperless-ngx/paperless-ngx
- https://github.com/OpenCTI-Platform/opencti
- https://github.com/liquidinvestigations/hoover-search
- https://github.com/CJWorkbench/cjworkbench
- https://github.com/ALW1EZ/PANO
- https://tech.occrp.org/projects/
- https://gijn.org/resource/