Swedish journalist Pär at Årebladet needs to validate Claude's document extractions before loading—Label Studio's template DSL + Argilla's SDK model + react-pdf-highlighter beat any off-the-shelf platform; build custom instead of fighting ML-team mental models.
OSS Scan — Document Ingest, OCR, Viewer, Human-in-the-Loop Review UI
Target: an investigative substrate for one Swedish journalist (Årebladet), internal use, never deployed publicly. Lift-friendly licenses preferred (MIT/BSD/Apache); AGPL OK as run-alongside service. C2 = per-doc-type review UI for ingest. C3 = scoped curated browse with one-line explainers.
Category 1 — Annotation / Human-in-the-Loop Review UI (for C2)
1. Label Studio — https://github.com/HumanSignal/label-studio
- License: Apache 2.0 (community edition); HumanSignal sells Enterprise on top
- Active: Very. Dominant in the space.
- Lift: XML-template-driven side-by-side layout (document on one pane, labeling form on another) is exactly the C2 pattern. The
<Object>/<Control> template DSL is genuinely elegant for "doc + structured form."
- Capability: C2
- Concerns: Heavy. Postgres/Redis/Django stack — overkill for one journalist. The whole product assumes "training-set production for a model," not "validating Claude's output before a loader runs." Adapting it means fighting its ML-team mental model. Lift the template idea, don't deploy the app.
2. Doccano — https://github.com/doccano/doccano
- License: MIT
- Active: Maintained but slower velocity than Label Studio. Django + Vue.
- Lift: Reference for minimal annotation UI; clean Django code to read. Project/dataset/label-type abstractions are sane.
- Capability: C2
- Concerns: Text-only. No PDF viewer — requires external OCR preprocessing. Useless for "show PDF next to extracted JSON" without surgery. License (MIT) is the main thing going for it as a code-lift target.
3. Argilla — https://github.com/argilla-io/argilla
- License: Apache 2.0
- Active: Yes, but acquired by Hugging Face (~$10M, late 2024). Roadmap now tilted toward HF dataset workflows.
- Lift: Their "record + suggestion + response" data model maps cleanly onto Claude-proposes / Pär-validates. SDK-first design (Python defines schema, UI auto-renders) is the right shape for "Claude generates the loader script and schema, UI follows."
- Capability: C2
- Concerns: Tight HF ecosystem coupling (Datasets, Spaces). Foreign assumption: every record is a row in an HF dataset. Bending it to "this PDF + this JSON proposal + this loader script preview" is plausible but non-trivial.
4. INCEpTION (UKP / TU Darmstadt) — https://inception-project.github.io/
- License: Apache 2.0
- Active: Yes, continuous academic development since ~2018; © 2026 still visible.
- Lift: Best-in-class active learning + recommender architecture (model proposes, human corrects, model re-trains). Document-centric, not sentence-centric.
- Capability: C2
- Concerns: Java/Spring Boot monolith. Academic codebase — readable but not idiomatic for a Python/JS shop. Foreign assumption: linguistic annotation (POS, NER, coreference). Reading the architecture is more valuable than running it.
5. Kiln AI — https://github.com/Kiln-AI/Kiln
- License: Core library + REST server MIT; desktop app source-available "fair-code"
- Active: Yes, growing fast in 2025–26.
- Lift: The "human rates LLM output, dataset accumulates, fine-tune later" loop is closer to C2's intent than any classical annotation tool. Synthetic data + eval + dataset management in one product.
- Capability: C2
- Concerns: Desktop-app license is not OSI-open — keep the lift to the MIT library/server. Built for LLM-eval workflows, not document-ingest-validation; the "doc-on-the-left" affordance is missing and would need building.
6. Refinery (Kern AI) — https://github.com/code-kern-ai/refinery
- License: Apache 2.0
- Active: Slowing. Kern AI pivoted product focus around 2024; check commit cadence before committing.
- Lift: Heuristic/weak-supervision-first design; lets a labeling function (read: Claude's proposal) populate suggestions in bulk and a human spot-check rather than label-by-label.
- Capability: C2
- Concerns: Postgres + 6+ microservices. Stale-ish. Worth reading the data model, probably not worth running.
Category 2 — Document Parsing / Structured Extraction (for C2)
7. Docling (IBM Research) — https://github.com/docling-project/docling
- License: MIT
- Active: Very. Now an LF AI & Data project; one of the most active doc-parsing repos in 2026.
- Lift:
DoclingDocument schema preserves semantic hierarchy (sections, tables, figures with reading order). That schema is itself a useful target for C3's "split massive PDFs per section, backlink to original."
- Capability: C2 and C3
- Concerns: Runs every page through ML models — wasteful on clean digital text. Slower than PyMuPDF for trivial PDFs. Best paired with a fast-path fallback.
8. Marker — https://github.com/VikParuchuri/marker
- License: GPL-3.0 (commercial license required for SaaS use — irrelevant here since internal-only)
- Active: Very.
- Lift: Fast PDF→Markdown with reasonable table/equation handling. Mac MPS support — runs locally on Apple Silicon, no GPU box needed.
- Capability: C2
- Concerns: GPL-3 means lifting code into a permissive project is contagious. Run as a subprocess/service, don't import. Quality below Docling on tricky layouts per 2026 benchmarks (0.861 vs 0.877).
9. MinerU (OpenDataLab) — https://github.com/opendatalab/MinerU
- License: AGPL-3.0 (run-alongside is fine per Pär's brief)
- Active: Very. Highest GitHub stars in the category.
- Lift: Strong layout detection; explicit JSON output for agentic pipelines.
- Capability: C2
- Concerns: Tuned for CJK content first — header/footer cleanup logic may be over-eager on Swedish docs. AGPL forces network-service isolation. Heavier than Marker; probably not the right default for one journalist.
10. olmOCR (AllenAI) — https://github.com/allenai/olmocr
- License: Apache 2.0 (toolkit); model weights also openly licensed
- Active: Very. olmOCR 2 (Oct 2025), v0.4.x ongoing.
- Lift: End-to-end VLM that handles scanned/messy PDFs, handwriting, equations in one pass. State-of-art for English print. Bolagsverket/Lantmäteriet scans are exactly the use case.
- Capability: C2
- Concerns: VLM inference cost — needs a GPU or remote call. Overkill for digital-native PDFs. Best as the OCR fallback when Docling/Marker fail.
11. Apache Tika — https://tika.apache.org/
- License: Apache 2.0
- Active: Yes. 3.3.1 stable (Mar 2026), 4.0 alpha out.
- Lift: The Swiss army knife — 1000+ formats, including the weird ones (Office, RTF, archives, emails, .msg). For "what is this file, and give me text + metadata," it's still the floor.
- Capability: C2
- Concerns: Java service. Text extraction only — no layout, no tables-as-structure. Use it as a format-detection + cheap-text first pass, not as the structured-extraction layer.
12. GROBID — https://github.com/kermitt2/grobid
- License: Apache 2.0
- Active: Yes (maintainer transitioned from Patrice Lopez to Luca Foppiano; Inria-backed). Used by Semantic Scholar, ResearchGate, CERN.
- Lift: Only worth it if Pär's corpus includes academic papers (mining-history theses?). Otherwise foreign-assumption-heavy.
- Capability: C2 — narrow
- Concerns: Scholarly-paper-shaped. Bolagsverket filings and Lantmäteriet maps look nothing like an arXiv preprint. Skip unless academic docs show up.
13. unstructured.io — https://github.com/Unstructured-IO/unstructured
- License: Apache 2.0 (library); paid Platform on top
- Active: Yes, but increasingly steering users to the hosted Platform; OSS partitioners are sometimes thinner than the SaaS equivalents.
- Lift: 60+ format partitioners as a polished Python API. Convenient for prototyping the ingest pipeline before specializing per doc-type.
- Capability: C2
- Concerns: Open-core tension — best parsers may live behind the paywall over time. Output quality on complex layouts trails Docling.
Category 3 — Document Viewers (for C3)
14. PDF.js (Mozilla) — https://github.com/mozilla/pdf.js
- License: Apache 2.0
- Active: Very.
- Lift: The default. Renders in browser, handles annotations for viewing. Pairs with
react-pdf-highlighter (MIT) for selection/highlight overlays.
- Capability: C3 and C2 (the doc pane in side-by-side review)
- Concerns: Can render existing highlights but persistence/save is limited; you store annotations in your own DB keyed by page+coords, not in the PDF. For C3 this is fine.
15. react-pdf-highlighter — https://github.com/agentcooper/react-pdf-highlighter
- License: MIT
- Active: Yes, v8 rc as of 2025–26.
- Lift: Drop-in React component for selection → highlight → comment, built on PDF.js. This is the closest thing to the side-by-side C2 affordance you can lift without writing it yourself.
- Capability: C2 (primary), C3 (highlights = backlinks)
- Concerns: React-only. Solo-maintainer project — read the code before depending heavily.
16. Datashare (ICIJ) — https://github.com/ICIJ/datashare
- License: AGPL-3.0
- Active: Yes — significant 2025 redesign.
- Lift: Tika + Tesseract + Elasticsearch + Vue 3 viewer, battle-tested on Panama-Papers-scale corpora. The doc viewer + faceted browse is closer to C3's intent than anything else in this scan. Their
datashare-preview subproject is a standalone document-preview server.
- Capability: C3 (primary), C2 partly
- Concerns: AGPL — run as service, don't import. Elasticsearch dependency is real ops weight. Search-first product, not browse-first; the "N docs with one-line explainers" view would need building on top.
17. OpenAleph / Aleph (DARC, formerly OCCRP) — https://github.com/openaleph/openaleph
- License: MIT (Aleph historically; OpenAleph fork inherits)
- Active: Yes — DARC actively developing post-fork (2024–26). OCCRP's own variant went commercial ("Aleph Pro"); OpenAleph is the community continuation.
- Lift: Investigative-journalism-shaped from day one. Entity extraction, cross-doc linking, "all docs about company X" is literally a built-in view (Followthemoney schema). This is the closest match to C3's brief in the whole scan.
- Capability: C3 (very strong), C2 (weak — no review-UI affordance)
- Concerns: Heavy stack (Postgres + Elasticsearch + Redis + workers + Aleph-specific FtM ontology). The Followthemoney schema is opinionated — adopting it means bending Pär's data model to theirs. But for the C3 capability alone, this is the strongest reference architecture.
18. Calibre-Web — https://github.com/janeczku/calibre-web
- License: GPL-3.0
- Active: Yes (17k+ stars). Also see
Calibre-Web-Automated fork.
- Lift: "Show me my N documents with summaries and metadata" browse UI for a personal library. Series/tags/custom-columns → maps onto Pär's "all Bolagsverket docs for company X."
- Capability: C3
- Concerns: Ebook-shaped (EPUB/MOBI assumptions, ISBN/author/series metadata). Bending it to "company filings" means fighting the schema. GPL contagion if lifting code; fine as inspiration for the browse-grid layout.
Honest summary
- For C2: Don't deploy any of these. Lift the Label Studio template DSL idea + Argilla SDK-first data model +
react-pdf-highlighter for the doc pane. Build the actual UI custom — it's two screens.
- For C2 parsing: Docling first, Marker as fallback, olmOCR for scans, Tika as the format-sniffer floor. Skip GROBID unless academic docs appear.
- For C3: OpenAleph is the reference architecture even if not deployed. Read their FtM data model and Vue viewer. Lift the layout, build your own.
- Skeptic note: Every tool here was built for a different user (ML teams, scholarly NLP, big-corpus investigative units of 50+). One Swedish journalist's tool is closer to a bespoke Svelte app reading a few Docling JSONs than to any of the above. The lift is patterns and a few components, not platforms.