OSS Scan — Document Ingest, OCR, Viewer, Human-in-the-Loop Review UI

Target: an investigative substrate for one Swedish journalist (Årebladet), internal use, never deployed publicly. Lift-friendly licenses preferred (MIT/BSD/Apache); AGPL OK as run-alongside service. C2 = per-doc-type review UI for ingest. C3 = scoped curated browse with one-line explainers.

Category 1 — Annotation / Human-in-the-Loop Review UI (for C2)

1. Label Studio — https://github.com/HumanSignal/label-studio

License: Apache 2.0 (community edition); HumanSignal sells Enterprise on top
Active: Very. Dominant in the space.
Lift: XML-template-driven side-by-side layout (document on one pane, labeling form on another) is exactly the C2 pattern. The <Object>/<Control> template DSL is genuinely elegant for "doc + structured form."
Capability: C2
Concerns: Heavy. Postgres/Redis/Django stack — overkill for one journalist. The whole product assumes "training-set production for a model," not "validating Claude's output before a loader runs." Adapting it means fighting its ML-team mental model. Lift the template idea, don't deploy the app.

2. Doccano — https://github.com/doccano/doccano

License: MIT
Active: Maintained but slower velocity than Label Studio. Django + Vue.
Lift: Reference for minimal annotation UI; clean Django code to read. Project/dataset/label-type abstractions are sane.
Capability: C2
Concerns: Text-only. No PDF viewer — requires external OCR preprocessing. Useless for "show PDF next to extracted JSON" without surgery. License (MIT) is the main thing going for it as a code-lift target.

3. Argilla — https://github.com/argilla-io/argilla

License: Apache 2.0
Active: Yes, but acquired by Hugging Face (~$10M, late 2024). Roadmap now tilted toward HF dataset workflows.
Lift: Their "record + suggestion + response" data model maps cleanly onto Claude-proposes / Pär-validates. SDK-first design (Python defines schema, UI auto-renders) is the right shape for "Claude generates the loader script and schema, UI follows."
Capability: C2
Concerns: Tight HF ecosystem coupling (Datasets, Spaces). Foreign assumption: every record is a row in an HF dataset. Bending it to "this PDF + this JSON proposal + this loader script preview" is plausible but non-trivial.

4. INCEpTION (UKP / TU Darmstadt) — https://inception-project.github.io/

License: Apache 2.0
Active: Yes, continuous academic development since ~2018; © 2026 still visible.
Lift: Best-in-class active learning + recommender architecture (model proposes, human corrects, model re-trains). Document-centric, not sentence-centric.
Capability: C2
Concerns: Java/Spring Boot monolith. Academic codebase — readable but not idiomatic for a Python/JS shop. Foreign assumption: linguistic annotation (POS, NER, coreference). Reading the architecture is more valuable than running it.

5. Kiln AI — https://github.com/Kiln-AI/Kiln

License: Core library + REST server MIT; desktop app source-available "fair-code"
Active: Yes, growing fast in 2025–26.
Lift: The "human rates LLM output, dataset accumulates, fine-tune later" loop is closer to C2's intent than any classical annotation tool. Synthetic data + eval + dataset management in one product.
Capability: C2
Concerns: Desktop-app license is not OSI-open — keep the lift to the MIT library/server. Built for LLM-eval workflows, not document-ingest-validation; the "doc-on-the-left" affordance is missing and would need building.

6. Refinery (Kern AI) — https://github.com/code-kern-ai/refinery

License: Apache 2.0
Active: Slowing. Kern AI pivoted product focus around 2024; check commit cadence before committing.
Lift: Heuristic/weak-supervision-first design; lets a labeling function (read: Claude's proposal) populate suggestions in bulk and a human spot-check rather than label-by-label.
Capability: C2
Concerns: Postgres + 6+ microservices. Stale-ish. Worth reading the data model, probably not worth running.

Category 2 — Document Parsing / Structured Extraction (for C2)

7. Docling (IBM Research) — https://github.com/docling-project/docling

License: MIT
Active: Very. Now an LF AI & Data project; one of the most active doc-parsing repos in 2026.
Lift: DoclingDocument schema preserves semantic hierarchy (sections, tables, figures with reading order). That schema is itself a useful target for C3's "split massive PDFs per section, backlink to original."
Capability: C2 and C3
Concerns: Runs every page through ML models — wasteful on clean digital text. Slower than PyMuPDF for trivial PDFs. Best paired with a fast-path fallback.

8. Marker — https://github.com/VikParuchuri/marker

License: GPL-3.0 (commercial license required for SaaS use — irrelevant here since internal-only)
Active: Very.
Lift: Fast PDF→Markdown with reasonable table/equation handling. Mac MPS support — runs locally on Apple Silicon, no GPU box needed.
Capability: C2
Concerns: GPL-3 means lifting code into a permissive project is contagious. Run as a subprocess/service, don't import. Quality below Docling on tricky layouts per 2026 benchmarks (0.861 vs 0.877).

9. MinerU (OpenDataLab) — https://github.com/opendatalab/MinerU

License: AGPL-3.0 (run-alongside is fine per Pär's brief)
Active: Very. Highest GitHub stars in the category.
Lift: Strong layout detection; explicit JSON output for agentic pipelines.
Capability: C2
Concerns: Tuned for CJK content first — header/footer cleanup logic may be over-eager on Swedish docs. AGPL forces network-service isolation. Heavier than Marker; probably not the right default for one journalist.

10. olmOCR (AllenAI) — https://github.com/allenai/olmocr

License: Apache 2.0 (toolkit); model weights also openly licensed
Active: Very. olmOCR 2 (Oct 2025), v0.4.x ongoing.
Lift: End-to-end VLM that handles scanned/messy PDFs, handwriting, equations in one pass. State-of-art for English print. Bolagsverket/Lantmäteriet scans are exactly the use case.
Capability: C2
Concerns: VLM inference cost — needs a GPU or remote call. Overkill for digital-native PDFs. Best as the OCR fallback when Docling/Marker fail.

11. Apache Tika — https://tika.apache.org/

License: Apache 2.0
Active: Yes. 3.3.1 stable (Mar 2026), 4.0 alpha out.
Lift: The Swiss army knife — 1000+ formats, including the weird ones (Office, RTF, archives, emails, .msg). For "what is this file, and give me text + metadata," it's still the floor.
Capability: C2
Concerns: Java service. Text extraction only — no layout, no tables-as-structure. Use it as a format-detection + cheap-text first pass, not as the structured-extraction layer.

12. GROBID — https://github.com/kermitt2/grobid

License: Apache 2.0
Active: Yes (maintainer transitioned from Patrice Lopez to Luca Foppiano; Inria-backed). Used by Semantic Scholar, ResearchGate, CERN.
Lift: Only worth it if Pär's corpus includes academic papers (mining-history theses?). Otherwise foreign-assumption-heavy.
Capability: C2 — narrow
Concerns: Scholarly-paper-shaped. Bolagsverket filings and Lantmäteriet maps look nothing like an arXiv preprint. Skip unless academic docs show up.

13. unstructured.io — https://github.com/Unstructured-IO/unstructured

License: Apache 2.0 (library); paid Platform on top
Active: Yes, but increasingly steering users to the hosted Platform; OSS partitioners are sometimes thinner than the SaaS equivalents.
Lift: 60+ format partitioners as a polished Python API. Convenient for prototyping the ingest pipeline before specializing per doc-type.
Capability: C2
Concerns: Open-core tension — best parsers may live behind the paywall over time. Output quality on complex layouts trails Docling.

Category 3 — Document Viewers (for C3)

14. PDF.js (Mozilla) — https://github.com/mozilla/pdf.js

License: Apache 2.0
Active: Very.
Lift: The default. Renders in browser, handles annotations for viewing. Pairs with react-pdf-highlighter (MIT) for selection/highlight overlays.
Capability: C3 and C2 (the doc pane in side-by-side review)
Concerns: Can render existing highlights but persistence/save is limited; you store annotations in your own DB keyed by page+coords, not in the PDF. For C3 this is fine.

15. react-pdf-highlighter — https://github.com/agentcooper/react-pdf-highlighter

License: MIT
Active: Yes, v8 rc as of 2025–26.
Lift: Drop-in React component for selection → highlight → comment, built on PDF.js. This is the closest thing to the side-by-side C2 affordance you can lift without writing it yourself.
Capability: C2 (primary), C3 (highlights = backlinks)
Concerns: React-only. Solo-maintainer project — read the code before depending heavily.

16. Datashare (ICIJ) — https://github.com/ICIJ/datashare

License: AGPL-3.0
Active: Yes — significant 2025 redesign.
Lift: Tika + Tesseract + Elasticsearch + Vue 3 viewer, battle-tested on Panama-Papers-scale corpora. The doc viewer + faceted browse is closer to C3's intent than anything else in this scan. Their datashare-preview subproject is a standalone document-preview server.
Capability: C3 (primary), C2 partly
Concerns: AGPL — run as service, don't import. Elasticsearch dependency is real ops weight. Search-first product, not browse-first; the "N docs with one-line explainers" view would need building on top.

17. OpenAleph / Aleph (DARC, formerly OCCRP) — https://github.com/openaleph/openaleph

License: MIT (Aleph historically; OpenAleph fork inherits)
Active: Yes — DARC actively developing post-fork (2024–26). OCCRP's own variant went commercial ("Aleph Pro"); OpenAleph is the community continuation.
Lift: Investigative-journalism-shaped from day one. Entity extraction, cross-doc linking, "all docs about company X" is literally a built-in view (Followthemoney schema). This is the closest match to C3's brief in the whole scan.
Capability: C3 (very strong), C2 (weak — no review-UI affordance)
Concerns: Heavy stack (Postgres + Elasticsearch + Redis + workers + Aleph-specific FtM ontology). The Followthemoney schema is opinionated — adopting it means bending Pär's data model to theirs. But for the C3 capability alone, this is the strongest reference architecture.

18. Calibre-Web — https://github.com/janeczku/calibre-web

License: GPL-3.0
Active: Yes (17k+ stars). Also see Calibre-Web-Automated fork.
Lift: "Show me my N documents with summaries and metadata" browse UI for a personal library. Series/tags/custom-columns → maps onto Pär's "all Bolagsverket docs for company X."
Capability: C3
Concerns: Ebook-shaped (EPUB/MOBI assumptions, ISBN/author/series metadata). Bending it to "company filings" means fighting the schema. GPL contagion if lifting code; fine as inspiration for the browse-grid layout.

Honest summary

For C2: Don't deploy any of these. Lift the Label Studio template DSL idea + Argilla SDK-first data model + react-pdf-highlighter for the doc pane. Build the actual UI custom — it's two screens.
For C2 parsing: Docling first, Marker as fallback, olmOCR for scans, Tika as the format-sniffer floor. Skip GROBID unless academic docs appear.
For C3: OpenAleph is the reference architecture even if not deployed. Read their FtM data model and Vue viewer. Lift the layout, build your own.
Skeptic note: Every tool here was built for a different user (ML teams, scholarly NLP, big-corpus investigative units of 50+). One Swedish journalist's tool is closer to a bespoke Svelte app reading a few Docling JSONs than to any of the above. The lift is patterns and a few components, not platforms.