A blank repository starting May 26, 2026, will selectively borrow function-level logic from the frozen gruvor codebase rather than inherit its architecture—rewriting each borrowed piece from first principles while explicitly lifting permissively-licensed solutions from open source.
02 — Build approach: blank repo with selective borrow
Date: 2026-05-26
Status: draft, first pass. Depends on 01-framing.md being ratified (it is, 2026-05-26 23:22).
Inputs: 01-framing.md (C1–C6 + hard constraints), Pars_view.md Q11–Q12, omega-1 §4 (what was valuable in gruvor), omega-2 §4 (the six forks), omega-2 §5 (operational micro-forks).
Purpose: specify what "blank repo with selective borrow from gruvor" means, why it is the chosen approach over the alternatives, and what choices remain open for Section 03 (function-level keep/borrow/drop) and Section 04 (MVP scope).
Pär ratified blank+borrow on 2026-05-26 23:22 over sliced-gruvor, OpenAleph-as-base, and Datashare-anchored. This document specifies that choice concretely. It does not re-litigate it.
1. The decision in one paragraph
The next substrate starts as a fresh repository, sibling to arebladet2, that borrows from gruvor at the function level — not the file or implementation level. Gruvor remains frozen in place until the new substrate is up; nothing is dragged forward by inertia. Direct code lift from permissive-license open source is explicitly fine — this is an internal tool, attribution discipline applies, but reproducing well-solved problems from scratch is not the goal. OpenAleph, Datashare, gitscrape, and similar are evaluated as candidate sources to lift from or run alongside this approach, not as alternative bases. Each function carried over from gruvor specifically is rewritten from its requirement, with gruvor's implementation read as one reference among several.
2. Why this and not the alternatives
Why not sliced-down gruvor (omega-2 Fork F or similar)
The omega-2 forks A/F treat "keep the spine, fix the entry points" as the lower-risk path. The reasons it does not get picked here:
- AI sunk-cost bias is the specifically named LLM failure mode in Pars_view.md ("AI's tend to avoid dumping its own code and starting over likely due to how it reads context, so it as an option needs to be reinforced even if we do not go that way"). Starting from gruvor reproduces exactly the pull this bias creates.
- Gruvor's architecture was largely right (omega-1 §1 lead, omega-2 §1). What was wrong was what was built on top of it. Sliced-down gruvor preserves the architecture but inherits the same tendency to grow Claude-shaped surfaces onto it (M2). Blank repo forces every surface to clear C1–C6 before it lands.
- 30 hand-coded loaders, 4185-line CLI, scheduler dark code, README/alias drift, broken success/empty-failure contract (omega-2 §2). Sliced-down gruvor's "fix list" is itself non-trivial. Blank repo trades that fix work for fresh-write work, and fresh-write work compounds toward the substrate's design intent.
- Pär's criterion 4 ("better to spend time now to build the infrastructure than to do small fixes"). Sliced-down gruvor is the small-fixes path under a larger frame.
Why not OpenAleph-as-base
OpenAleph (DARC, MIT-licensed, the open-source successor to alephdata/aleph) is a real candidate, named by Pär in Q7. Reasons it is not the base — stated honestly, since this is the rejection Codex flagged as the weakest in an earlier draft:
- Foreign maintenance surface (load-bearing reason). Adopting OpenAleph as base means Pär+Claude carry a large upstream codebase whose architecture decisions are not ours to make and whose update cadence is not ours to set. For a tool whose value is its fit to Pär's workflow over years, that tax is real and ongoing. This is the actual decisive reason.
- Optionality cost. Base-choice locks in a vocabulary (Aleph's entity model, FTM schema vocabulary, the investigation-as-collection metaphor). The substrate's most consequential capability (C4 — cross-investigation connected database) wants a vocabulary we control. Building on OpenAleph means every C4 design decision is mediated through Aleph's existing concepts. Not a claim that Aleph is provably incompatible with C4 — no evidence supports that; Aleph's FTM is in fact strong on cross-entity reference. The claim is the weaker one: starting from our own data model is cheaper than translating from theirs.
- Swedish NER gap. OpenAleph and Datashare share NER tooling that doesn't cover Swedish (omega-1 §5 gotcha 2). A sidecar (spaCy
sv_core_news_lg or KB-BERT) is needed for either. That cost is borne wherever the document layer is — neutral on base choice but worth flagging.
- OpenAleph as source to lift from — explicitly open. The above rule out OpenAleph as the base. They do not rule out lifting code directly from it (MIT-licensed): their entity model, FollowTheMoney schema implementation, document-ingest patterns, NER pipeline scaffolding. Caveat (per Codex): lifting these pieces silently imports the very architectural assumptions the "optionality cost" point flags. Section 03 must check each candidate lift against C1/C4 explicitly, not adopt OpenAleph code under the cover of "borrowing functions."
Why not Datashare-anchored
The omega-1 §5 sweet-spot hypothesis was "Datashare for documents + thin Swedish-pull layer + parmaps for maps." The omega-2 C1 standup found Datashare ingested 225 X92 docs in seconds with full-text search + in-browser PDF + entity browse. Real strengths. Reasons it is not the anchor:
- C1 (timeline-of-all-things with area filter) is the most consequential capability in the framing. Datashare's date-faceted document filter is not equivalent to a timeline-of-events model (omega-2 §4 Fork B KG-freeze cost). Anchoring on Datashare buys document ergonomics at the cost of the substrate's load-bearing capability.
- C4 (cross-investigation discovery) requires entity-level linking that Datashare does not natively do. Datashare projects are isolated by design — you can search across them, but the connected-database shape is not native.
- Tagging vs typed structure — Datashare's primary affordance for "this is a thing" is tags. Tags are fine for many uses; they are not a substitute for typed entities with relationships, which C4 will need.
- Datashare as component — open, with a real license boundary. Datashare can sit alongside the new substrate as the document-corpus surface: in-browser PDF viewing, full-text search with snippets, OCR backfill. Running it as a separate unmodified service we call via API is clean. Lifting Datashare code into our repo, or modifying Datashare and exposing the result over our UI, can trigger AGPL §13 source-offer obligations on the combined work — including for internal network deployments, since AGPL fires on network conveyance regardless of who the users are. This is not academic; Section 03 has to decide run-alongside vs lift-pieces with that constraint in view. What Datashare cannot do is be the substrate.
Why blank+borrow specifically
- Forces explicit design. Every entity, every surface, every loader has to earn its place against C1–C6 + hard constraints. Nothing carries forward by inertia.
- Honors Pär's criterion 7 ("no code is sacred") concretely. The frame is what does the next investigation need, not what already exists.
- Risk-managed by borrow. Function-level borrowing from gruvor (Bolagsverket pulls, the timeline model, the doctor pattern) avoids the dumbest cost of blank-slate work — re-deriving the things gruvor already solved. The borrowing is at the level of "this function, with its requirement re-stated, with its implementation read from gruvor as one reference," not "copy this file."
3. What "blank" means concretely
- New repository. Name and location decided in §5 below. Not a branch of gruvor. Not a subdirectory under arebladet2. Sibling repo.
- No dependencies on gruvor's filesystem layout, schema, or module structure. The new repo can read gruvor's data exports as inputs to its own bootstrapping (parquet snapshots, raw API pulls, the dossier corpus), but it does not import gruvor as a Python package or symlink gruvor directories.
- No carrying over of CLI shape. The 4185-line
cli.py is not a starting point; it is a finding about what shape to avoid (M2).
- No carrying over of UI routes. Routes that worked in gruvor (
/kb-timeline, /permits, /search, gruvor verify fact-pack) are evidence about which capabilities matter, not URL paths to reproduce.
- No graveyard of partial features. The 25 unused dossier folders, the dark scheduler, the half-built
/companies/{orgnr} expansion, /people, /media, /geo, /dashboard, /inbox, /dropins, /story, /factpack — none of these appear in the new repo as scaffolding.
- A clean dependency graph. The new repo is allowed to grow only in directions that pay rent against C1–C6.
What blank does not mean:
- It does not mean "ignore what we learned." The point of all three omega-reviews is the input to this section.
- It does not mean "fresh from a tutorial." The substrate is mature in intent from day 1; only the code is new.
- It does not mean "redo every primitive from scratch." Bolagsverket pulls are a solved problem; the borrow path inherits the function, not the file.
4. What "borrow" means concretely
Borrowing operates at the function level (not the file level) for gruvor, and at the code level (not just integration level) for permissive-license open source. The substrate is an internal tool; license discipline is real, but the framing is "use what is useful, attribute when required, do not reproduce gruvor's shape by inertia."
Four borrow tiers. The full keep/borrow/drop table is Section 03's job — this section commits to the tier shape only.
Tier 1 — Borrow the function from gruvor, rewrite the implementation. Things gruvor solved that the substrate needs: Bolagsverket-ingest, SGU/Bergsstaten pulls, the timeline model (capability, not data structure), spatial joins on EPSG:3006, PII-redaction discipline, the press-question sequencing methodology. The new substrate writes these from their requirement; gruvor's code is read as a reference, not pasted. Why rewrite-not-paste for gruvor specifically: gruvor's code carries the shape that omega-1/omega-2 flagged (4185-line CLI, route assumptions, scheduler patterns); the requirements are clean, the implementations are not.
Tier 1B — Lift code directly from permissive-license open source. OpenAleph (MIT), gitscrape, FollowTheMoney schema (MIT), spaCy sv_core_news_lg model, KB-BERT, smaller utility libraries. Where a permissive-license (MIT/BSD/Apache) project has a clean implementation of something the substrate needs, lifting code is explicitly fine. Internal tool, no redistribution concerns for permissive licenses. Discipline: attribute in source where the license requires it, keep a THIRD_PARTY.md log of where things came from, and re-evaluate license compatibility if the substrate ever graduates to a public deploy.
Datashare is not in Tier 1B. Datashare is AGPL-licensed and gets handled separately. The clean boundary is running Datashare as a separate unmodified service (we issue API calls to it; nothing lifted into our repo). Lifting Datashare code into the substrate or modifying it and exposing the combined result over a network — including internally over our own UI — can trigger AGPL §13 source-offer obligations on the combined work. "Internal tool" does not exempt this; AGPL's network-conveyance trigger fires regardless of public-deploy status. If Section 03 wants Datashare's document-viewer or full-text-search UI, the default path is "embed an iframe to the unmodified Datashare service," not "lift its viewer code into our repo." Modifying Datashare or lifting any of its source files needs a deliberate decision, not a shrug.
Architecture-by-borrow drift — explicit risk. Tier 1B can import an entity model, document model, ingest pipeline, or schema piecemeal along with its architectural assumptions. The assumptions are not visible at the function level but compound at the C1/C4 level. Section 03's per-candidate check has to evaluate each lift against the framing's capabilities, not just against "does this function work."
Tier 2 — Migrate the data from gruvor, drop the code. Things where gruvor's output is durable but its production code is not worth carrying: the X92 dossier corpus, the 27 dossier folders' content, the parquet snapshots, the API credentials, BankID-pulled documents. The new substrate consumes these as inputs to its own bootstrapping.
Tier 3 — Drop entirely. Things that didn't earn their build cost (omega-1 §4 list): /companies/{orgnr} expansion, /graph route shape, scheduler/LaunchAgent/doctor stack as currently implemented, write surfaces (/inbox, /dropins, /story), scaffold routes, the 4185-line CLI. Nothing borrowed; not even as reference.
Tier 1B is the change from "blank repo with selective borrow from gruvor" to "blank repo with selective borrow, including direct code lift from permissive OSS." It is the path that closes the gap on document-corpus features (where Datashare and OpenAleph are mature) and Swedish NER (where KB-BERT exists). Section 03 names specific Tier 1B candidates per capability.
5. Repo location, name, and shape
Location
~/ai/tools/<name>/ — under tools (it is a tool, with code), not projects (which are research/explorations per ~/ai/projects/CLAUDE.md). Sibling to arebladet2/. Specifically not inside arebladet2/ per the framing's "modular for arebladet2, not built into arebladet2 yet" hard constraint.
Name (open question)
Candidate constraints:
- Short. Two syllables preferable.
- Not "gruvor 2" or any -v2 / -next suffix (reproduces the AI sunk-cost framing).
- Not mining-specific (the platform must be reusable for non-mining work).
- Names something the substrate does, not something it is.
This document does not pick a name. Pär picks the name; a few candidates to react to:
spelt — a grain (small reference to digging/foundation), short, neutral.
gnejs — Swedish for gneiss, the bedrock under the Bergslagen mining district; investigations dig into bedrock. Mining-resonant but not mining-bound.
forskning — Swedish for research. Direct, slightly generic.
groop — Swedish dialect for a small mine pit. Mining-resonant.
kall — neutral, place-name-feeling, also resonates with the Kall power station project. Probably too overloaded.
fanta — short for fantastisk / fantasi, no mining bias.
None of these is a recommendation. The naming choice is the kind of thing that benefits from one clean decision rather than from comparison; Pär picks or vetoes and proposes.
Shape (open, listed only to flag the questions Section 04 will close)
This subsection lists open shape questions, not decisions. Each is Section 04's to commit on. Listing them here only to keep Section 03 (component selection) from inheriting them as quiet assumptions:
- Primary language and project layout. Likely Python at the top given gruvor's Python heritage and Pär's
uv standard, but not committed; if a lifted component pulls in a different runtime (Datashare is JVM), the right answer may be polyglot.
- One repo vs. coordinated repos. Open. Running Datashare alongside is a coordinated-systems shape regardless; whether our own code is one project or several is Section 04's call.
- Surface split. A small CLI exists for other tools to call (per hard constraint — CLI for other tools = OK). A web UI is Pär's primary surface (per hard constraint — UI must be useful). How those layers connect is design work, not a framing commitment.
- Datashare / OpenAleph posture. Run-alongside, lift-pieces (subject to AGPL constraint above for Datashare; MIT-clean for OpenAleph but with the architectural-assumption caveat in §2), or skip entirely — Section 03 decides per capability.
- Data layer scoping. Per C4, the data layer must support cross-investigation entity identity. Whether that lives in one DB, several, or as a virtual layer over per-investigation stores is a Section 04 decision.
6. How this serves the C1–C6 capabilities — compatibility check, not design
This is a no-obvious-incompatibility check, not a capability-delivery argument. Demonstrating that nothing in blank+borrow forbids a capability does not protect the load-bearing parts of that capability. The actual delivery is Section 04's (MVP scope) and beyond. Codex flagged this section as "compatibility theatre" in an earlier draft — this rewrite restates it honestly.
- C1 (timeline + area filter). No incompatibility — blank repo can pick any timeline implementation. Unresolved: area-filtered event chronology requires a data model that joins events to typed places-or-areas without forcing point geometry. Section 04 must specify this. Risk if not specified: we re-derive gruvor's point-only filtering by accident.
- C2 (per-doc-type review UI for ingest). No incompatibility, and the approach is positively suited — gruvor's 30 hand-coded loaders are exactly what review-UI ingest replaces.
- C3 (scoped curated doc browse). No incompatibility. Unresolved: whether this surface is built from scratch, Datashare-backed (run-alongside), or OpenAleph-lift-pieces is a Section 03 component decision.
- C4 (cross-investigation connected database). No incompatibility, but this is the capability most at risk of being papered over. The load-bearing requirement is durable entity / document / event identity across investigations, which means a data model that names a "company" or "person" or "permit" the same way whether it shows up in mining work or non-mining work. Section 04 must specify the entity-identity strategy explicitly. Risk if not specified: each investigation grows its own local IDs and C4's payoff (cross-investigation discovery) never materializes.
- C5 (media coverage / press articles as first-class entity). No incompatibility — blank repo can build the press-articles entity into the data model from the start. Section 04 includes this in the entity model.
- C6 (ingest permissive). No incompatibility — blank repo can write an "incoming/unsorted" landing zone that gruvor never had cleanly.
No capability is in obvious tension with blank+borrow. C1 and C4 carry unresolved load that Section 04 must address; flagging here so they don't get inherited as "solved" by Section 03.
7. How this serves the hard constraints
- UI must be useful, not point elsewhere. Blank repo is the natural home for this discipline; gruvor had this fail at
/companies/{orgnr} § the 📓 Dossier card. Section 04 builds in a "no path-rendering as UI feature" check.
- No mining-specific assumptions in the platform layer. Blank repo enforces this by construction. Sliced-gruvor would have to back this out.
- Bolagsverket-ingest as first-class reusable function. Blank repo writes this as a cleanly callable function from day 1, not as a CLI subverb buried in 4185 lines.
- Public spatial/permits data goes to gruvkartor.se. Blank repo can simply not implement that surface. Sliced-gruvor would have to remove it.
- No CLI as primary Pär surface. Blank repo writes no Pär-facing CLI. Period.
- Modular for arebladet2. Blank repo sits as a sibling with a clean export interface from day 1.
- Factual interpretations: compute or cite, not characterise. Blank repo builds this into the AI-output pipeline (no shortcuts at story-walkthrough time). Section 04 specifies where the guard lives.
/cowork as strong default for build sessions. Blank repo's first commits are built under this discipline.
8. Risks of this approach (named, not hand-waved)
Blank+borrow is not free. The honest cost list:
- Re-implementation cost on Tier 1 functions. Bolagsverket ingest, SGU pulls, EPSG:3006 spatial joins are not trivial. Borrowing the function and rewriting the implementation means we pay the rewrite cost. Mitigated by: gruvor's code remains readable as reference, and the rewrite has a clearer target than gruvor's original write did.
- Architecture-by-borrow drift (Tier 1B specific). Lifting code from OpenAleph or another permissive-licensed project silently imports the architectural assumptions of the upstream — entity models, ingest contracts, ID schemes, naming conventions. Function-level adequate is not capability-level adequate. The risk is that the substrate accumulates a chimera of upstream models that don't compose into the C4 cross-investigation database we actually need. Mitigated by: Section 03's per-lift evaluation explicitly checks each candidate against C1/C4 before accepting;
THIRD_PARTY.md log makes drift visible; any lift that brings its own entity model is treated as a model-decision, not a code-decision.
- Upstream-update friction on lifted code. Once we lift code from OpenAleph or similar, we own that fork. Upstream bug fixes, security patches, and improvements do not flow in automatically. Mitigated by: prefer "depend on as library" over "lift" where the library shape is clean enough; lift sparingly; document the lift point so the cost is visible.
- AI sunk-cost reflex in the other direction. Once the new substrate exists, the same bias will pull future-Claude away from looking at gruvor's code as reference. Mitigated by: this document explicitly names gruvor as reference material.
- Re-introducing Claude-shaped surfaces (M2). Blank repo doesn't immunise against the same failure mode that produced the abandoned KG +
/graph + scaffold routes. Mitigated by: every surface clears C1–C6 before landing; /cowork discipline for non-trivial sessions; review-UI as the centre of ingest rather than ad-hoc loader sprawl.
- Capacity-filling (M3). A blank repo with plentiful Claude quota is exactly the recipe for premature scaffolding. Mitigated by: MVP scope (Section 04) sized to support starting Aura/Häggån, not to ship Aura/Häggån. Capacity-filling is what the section ordering of this planning sequence is for — front-loaded discussion is the M3 antidote.
- Time-to-first-investigation cost. Blank+borrow takes longer than sliced-gruvor to reach "I can open the substrate and start ingesting Aura docs." Mitigated by: gruvor is frozen, not deleted; if Aura/Häggån research has to start before the substrate is ready, gruvor handles it interim.
These are the risks. They are tractable. They are not reasons to reverse the decision.
9. What this section deliberately does not commit to
- Repo name. Pär picks.
- Specific tier-1/tier-2/tier-3 function lists. Section 03's job.
- Specific MVP feature set. Section 04's job.
- Specific data model for entities, events, documents. Section 04's job at the earliest, possibly later.
- Whether Datashare, OpenAleph patterns, or gitscrape become components. Section 03 evaluates each as a candidate component.
- Whether the new substrate has a UI server, a static site, or both. Section 04.
- Deploy target. Deferred per framing §6.
- First commit date or session sequence. Pär controls cadence; Aura/Häggån's actual launch shape is the constraint that sets timing.
10. Acceptance signal for this section
This section is good enough to anchor Section 03 if Pär can answer yes to:
- The approach is "blank repo, sibling to arebladet2, borrow gruvor at the function level."
- The three rejected alternatives (sliced-gruvor, OpenAleph-base, Datashare-anchor) are correctly rejected for the reasons given.
- The borrow tiers (rewrite-the-function / migrate-the-data / drop-entirely) are the right shape.
- The risk list is honest and the mitigations are real.
- Nothing in §9 (deliberately not committed to) should actually be in §1–§8.
If any answer is no, this section is revised before Section 03 starts.