PärPod Temp
PärPod Temp
PärPod Temp
OSS Scan — Document Ingest, OCR, Viewer, Human-in-the-Loop Review UI
Episode 315m · May 28, 2026
Swedish journalist Pär at Årebladet needs to validate Claude's document extractions before loading—Label Studio's template DSL + Argilla's SDK model + react-pdf-highlighter beat any off-the-shelf platform; build custom instead of fighting ML-team mental models.

OSS Scan — Document Ingest, OCR, Viewer, Human-in-the-Loop Review UI

Target: an investigative substrate for one Swedish journalist (Årebladet), internal use, never deployed publicly. Lift-friendly licenses preferred (MIT/BSD/Apache); AGPL OK as run-alongside service. C2 = per-doc-type review UI for ingest. C3 = scoped curated browse with one-line explainers.

Category 1 — Annotation / Human-in-the-Loop Review UI (for C2)

1. Label Studio — https://github.com/HumanSignal/label-studio

2. Doccano — https://github.com/doccano/doccano

3. Argilla — https://github.com/argilla-io/argilla

4. INCEpTION (UKP / TU Darmstadt) — https://inception-project.github.io/

5. Kiln AI — https://github.com/Kiln-AI/Kiln

6. Refinery (Kern AI) — https://github.com/code-kern-ai/refinery

Category 2 — Document Parsing / Structured Extraction (for C2)

7. Docling (IBM Research) — https://github.com/docling-project/docling

8. Marker — https://github.com/VikParuchuri/marker

9. MinerU (OpenDataLab) — https://github.com/opendatalab/MinerU

10. olmOCR (AllenAI) — https://github.com/allenai/olmocr

11. Apache Tika — https://tika.apache.org/

12. GROBID — https://github.com/kermitt2/grobid

13. unstructured.io — https://github.com/Unstructured-IO/unstructured

Category 3 — Document Viewers (for C3)

14. PDF.js (Mozilla) — https://github.com/mozilla/pdf.js

15. react-pdf-highlighter — https://github.com/agentcooper/react-pdf-highlighter

16. Datashare (ICIJ) — https://github.com/ICIJ/datashare

17. OpenAleph / Aleph (DARC, formerly OCCRP) — https://github.com/openaleph/openaleph

18. Calibre-Web — https://github.com/janeczku/calibre-web


Honest summary