OSS Investigation-Substrate Scan

Scanned 2026-05-26 for a Pär-grade journalist investigation substrate (successor to gruvor). Filter: permissive license preferred (MIT/BSD/Apache); AGPL OK as run-alongside; non-commercial / Do-No-Harm flagged as blockers for code lift. Already known and excluded: OpenAleph, Datashare, gitscrape.

Ranking is by lift-value (how directly we'd reuse code/schema in a new repo), not by general fame.

Tier 1 — Lift code directly

1. FollowTheMoney (alephdata/opensanctions fork)

URL: https://github.com/opensanctions/followthemoney (active fork; alephdata/followthemoney is the historical home)
One-liner: Python data model + ontology for investigative entities (Person, Company, Asset, Payment, Vessel, CourtCase, Document...).
License: MIT. Active: pushed 2026-05-22, 67 stars on fork, broader use everywhere.
Lift: the entire schema. Thing → LegalEntity → Person/Company/Organization, properties with types (name, country, address, identifier, date), RDF/OWL serialization. Use as the entity table shape; map gruvor mining entities (Bolag, Gruva, Förekomst) onto Company/Asset/Location.
Covers: C4 (cross-investigation requires a shared entity vocab — FtM is it), C6 (schema lets "unknown" entities land as Thing and get refined).
Concerns: opinionated toward financial-crime use-cases; Swedish-place / mineral-deposit nuance not modelled — extend with custom schemata. Schema is XML-ish YAML, slight learning curve.

2. Nomenklatura

URL: https://github.com/opensanctions/nomenklatura
One-liner: Entity deduplication and cross-dataset resolver built on FtM.
License: MIT. Active: 2026-05-22, 242 stars.
Lift: the Resolver graph (judgements as edges, canonical IDs), blocking, the matcher/scorer interfaces. This is precisely the engine C4 needs — given a new ingest, find candidate matches against the union of past investigations.
Covers: C4 (core), C6 (resolver tolerates partial entities).
Concerns: assumes you've already standardised on FtM (which makes #1 a prerequisite, not a complication).

3. Yente

URL: https://github.com/opensanctions/yente
One-liner: FastAPI service exposing entity search + bulk-match + Reconciliation-API over an FtM index (Elasticsearch/OpenSearch backend).
License: MIT. Active: 2026-05-26, 133 stars.
Lift: drop-in search/match API. If the new substrate stays FtM-shaped, yente gives us OpenRefine-compatible reconciliation for free.
Covers: C4, C6, indirectly C3 (faceted browse via ES).
Concerns: brings Elasticsearch as infra; for a single-journalist install that's heavy — but it's the simplest path to "search across all investigations". Run-alongside, not lift-code.

4. vis-timeline

URL: https://github.com/visjs/vis-timeline
One-liner: Mature interactive timeline JS library — items, ranges, groups, zoomable.
License: Apache-2.0 OR MIT (dual). Active: 2026-05-21, 2.5k stars.
Lift: the renderer for C1. Plain library, no architectural lock-in. Pair with a Leaflet map and bbox filter for the "spatial filter by area, not point" requirement.
Covers: C1 primarily.
Concerns: none significant; vanilla JS, but Svelte/React wrappers are trivial.

5. forensic-architecture/timemap

URL: https://github.com/forensic-architecture/timemap
One-liner: Reference frontend that already combines Leaflet map + d3 timeline + tag/category filters for incident exploration.
License: Do No Harm (custom, derived from BSD-3) — not OSI-approved, not strictly permissive. Pushed 2026-06 (latest commit dated 2025-06 in API but recent enough), 377 stars.
Lift: UI patterns and the time-space-tag triad as a layout reference. Lift visual/interaction design; rewrite the code rather than vendoring, given license ambiguity.
Covers: C1 (gold-standard reference), C5 (events-as-first-class works for press articles too).
Concerns: Do No Harm license is a yellow flag for any redistributable derivative — treat as inspiration, not dependency. Pär: I have no problem with this license and there is no redistribute need.

6. memorious

URL: https://github.com/alephdata/memorious
One-liner: Distributed scraper framework (Celery + Python) used across the OCCRP stack for source-site crawling.
License: MIT. Active: 2026-05-20, 315 stars.
Lift: crawler/scheduler primitives if Pär wants to monitor sources (Bolagsverket pages, SGU registries, kommun protokoll). Direct fit for the "press articles as first-class entities" pipeline — point it at arebladet.se + competitors.
Covers: C5 (ingest press), C6 (defers parsing).
Concerns: opinionated around Celery; smaller alternative is a plain httpx + apscheduler script if scale is one journalist.

7. Label Studio

URL: https://github.com/HumanSignal/label-studio
One-liner: Self-hosted multi-format annotation UI with NER, relations, classification, PDF rendering, ML backends.
License: Apache-2.0. Active: 2026-05-26, 27k stars.
Lift: C2 in a box — the per-doc-type review UI for human-validated AI parses. Either run alongside (its labelling config maps cleanly to FtM properties) or lift its labelling-config DSL.
Covers: C2 (direct), C6 (label-on-arrival is exactly the workflow).
Concerns: heavy (Django + React + Postgres). For a one-journalist tool it's overkill; but the labelling-config XML schema is genuinely worth lifting even if we rebuild the UI.

Tier 2 — Borrow patterns / vendor a subsystem

8. Aleph (alephdata/aleph)

URL: https://github.com/alephdata/aleph
One-liner: Document + structured-data search across investigations with entity cross-referencing.
License: MIT. Status: official maintenance ends Dec 2025; OpenAleph (already on your list) is the soft fork.
Lift: cross-reference UI patterns (profile API for "merge these two entities?"), the document-rendering pipeline. Most concrete bits are already in OpenAleph; revisit if a specific subsystem is cleaner upstream.
Covers: C3, C4.
Concerns: large, opinionated stack (ES + Postgres + Redis + RabbitMQ). Sunsetting branch — borrow code, don't deploy.

9. Paperless-ngx

URL: https://github.com/paperless-ngx/paperless-ngx
One-liner: Self-hosted doc management with OCR (Tesseract), auto-tagging, correspondents, types, web UI.
License: GPL-3.0. Active: 2026-05-26, 41k stars.
Lift: ingest watch-folder pattern, OCR pipeline wiring (tesseract + ocrmypdf glue), the "correspondent / type / tag" 3-axis model — a useful frame for C2 schemas.
Covers: C2 (ingest UI), C3 (curated browse), C6.
Concerns: GPL-3.0 means any direct code lift infects the new repo. Use as run-alongside ingest, or read for patterns. No FtM-style entity layer — it stops at tags.

10. OpenCTI

URL: https://github.com/OpenCTI-Platform/opencti
One-liner: Knowledge-hypergraph platform for cyber threat intelligence; STIX 2.1 schema; React+GraphQL+ElasticSearch.
License: Apache-2.0 (Community Edition; EE is separate). Active: 2026-05-26, 9.4k stars.
Lift: the Investigation UI (entity selection, graph expansion, knowledge-graph navigator) is a strong reference for C4. Frontend patterns lift cleanly; backend is heavy.
Covers: C4 (architectural reference), C1 (it has time-aware entity views).
Concerns: STIX schema is cyber-shaped (Indicator, Malware, AttackPattern) — wrong vocab for mining. Use the UX, not the schema. Big stack, foreign-domain assumptions.

11. INCEpTION

URL: https://github.com/inception-project/inception
One-liner: Semantic annotation platform with knowledge-base linking, recommenders, active learning.
License: Apache-2.0. Active: 2026-05-26, 696 stars.
Lift: KB-linking UI is the closest thing to "highlight a name, link to existing entity, or create new" that C2 + C4 need together. Strong adjudication/inter-annotator features.
Covers: C2, C4 (entity linking).
Concerns: Java/Spring stack — won't lift directly into a Python/JS substrate, but the UX is worth screenshotting and reimplementing.

12. Hoover (liquidinvestigations/hoover-search)

URL: https://github.com/liquidinvestigations/hoover-search
One-liner: Backend for searching large doc collections — Tika + ES — used by EIC for Football Leaks etc.
License: MIT. Active: 2024-09 (going stale), 21 stars on this repo (project sprawls across siblings).
Lift: the simpler-than-Aleph search backbone if a journalist wants ES-backed full-text without the FtM weight. Plays well as a microservice.
Covers: C3.
Concerns: stale upstream, niche community. Probably skip unless we specifically want a leaner Aleph.

13. doccano

URL: https://github.com/doccano/doccano
One-liner: Lightweight web annotation tool (text classification, NER, seq2seq).
License: MIT. Active: 2026-04-14, 10.6k stars.
Lift: simpler-than-Label-Studio alternative for C2 if Pär only needs NER and classification, not relations. Schema-to-UI mapping is clean Django/Vue.
Covers: C2.
Concerns: no PDF rendering, no relation annotation, no ML backend — Label Studio supersedes it for this use-case. Listed only as a fallback.

14. CJWorkbench

URL: https://github.com/CJWorkbench/cjworkbench
One-liner: Reproducible data-journalism pipeline modules (scrape → clean → analyse → publish).
License: CC BY-NC-4.0 (non-commercial).
Status: pushed 2024-12, effectively stale.
Lift: pipeline-module pattern (each module declares params + transform) is worth a glance for C6's ingest-then-defer architecture. Don't vendor code — non-commercial license is a hard block even for internal tools you might later open-source.
Concerns: license + staleness. Listed for completeness; skip.

Excluded after closer look

PANO (ALW1EZ/PANO): CC BY-NC-4.0 — non-commercial, blocks lift. PySide6 desktop app; wrong shape.
DocumentCloud (MuckRock fork): AGPL-3.0, Rails + Ember. Run-alongside only; UI not worth replicating.
Khoj: AGPL-3.0, AI-second-brain frame doesn't model entities/timelines as first-class. Wrong abstraction.
Dgraph / Memgraph / Cayley: graph DBs, not investigation tools. If we eventually need one, FtM serializes to RDF and any of them works — but no entity model, no ingest, no UI. Defer.
Bellingcat toolkit: a curated list of tools, not a substrate. Useful index, nothing to lift.

Suggested composition

Build a Python+SQLite (or Postgres) + Svelte substrate. Lift:

followthemoney schema (extend with mining-domain types: Mineral, Förekomst, Provborrhål).
nomenklatura Resolver for cross-investigation entity matching (C4).
vis-timeline + Leaflet for C1, with bbox spatial filter wired directly.
Label Studio config DSL (lift idea, not code) for per-doc-type review C2 — render in Svelte against an FtM-typed payload.
memorious patterns (or a 200-line equivalent) for ingesting Årebladet + competitor RSS as Article FtM entities (C5).
Optionally run yente alongside for the search/reconcile API once corpus passes ~10k entities.

Everything load-bearing is MIT or Apache-2.0. No AGPL in the lift path. Forensic-Architecture's timemap is the design reference; the rest is plumbing.

Sources

https://github.com/opensanctions/followthemoney
https://github.com/opensanctions/nomenklatura
https://github.com/opensanctions/yente
https://github.com/alephdata/aleph
https://github.com/alephdata/memorious
https://github.com/visjs/vis-timeline
https://github.com/forensic-architecture/timemap
https://github.com/HumanSignal/label-studio
https://github.com/doccano/doccano
https://github.com/inception-project/inception
https://github.com/paperless-ngx/paperless-ngx
https://github.com/OpenCTI-Platform/opencti
https://github.com/liquidinvestigations/hoover-search
https://github.com/CJWorkbench/cjworkbench
https://github.com/ALW1EZ/PANO
https://tech.occrp.org/projects/
https://gijn.org/resource/