Before The First Commit: Twelve Open Questions Standing Between Gruvor And What Comes Next

The Empty Spaces In A Finished Plan

The planning document looks finished. It picks a goal. It picks an approach. It rejects three alternatives in detail. It lists six load-bearing capabilities and seven hard constraints. The last section is called acceptance signal and reads like a contract.

And then there is section nine. Section nine is titled, in both planning documents, what this section deliberately does not commit to. It is the second column. It is the list of things the plan refuses to plan.

[serious] This episode is a tour of that second column. Twelve open questions. Six need to be answered before the next planning document can be written. Six are bigger. They shape the substrate over years, not days. The first six are the doors Pär has to walk through this week. The second six are the doors that have to open before the first investigation ships.

Each question is its own arc. The background is where the question comes from. The ideas are what is already on the table. The exploration is how the planning has wrestled with it. And the question itself is the spine that has not yet been straightened.

We begin with the smallest of them. The one that blocks no function but blocks every commit.

Part One. What To Call It

Pär's next investigative substrate begins not with a function, not with a schema, not with a route. It begins with a name. The planning document refuses to choose one. It lists six candidates and walks away.

Spelt is a grain, small and quiet, a reference to digging and foundation. Gnejs is the Swedish word for gneiss, the bedrock under the Bergslagen mining district. Forskning means research, direct but slightly generic. Groop is Swedish dialect for a small mine pit. Kall feels neutral but pulls on the Kall power station project, possibly too overloaded. Fanta is short for fantastisk or fantasi, with no mining bias at all.

None of these is a recommendation. The constraint list is sharp. Short. Not gruvor with a number after it. Not mining-specific, because the platform has to be reusable for non-mining work. Names something the substrate does, not something it is. The naming choice is the kind of thing that benefits from one clean decision rather than from comparison.

So the first question, the one that blocks no function but blocks the first git commit, is the simplest one. What does this thing get called?

The Spine: Borrowed Or Grown

The OpenSanctions ecosystem has solved a problem the substrate cannot avoid. It is called FollowTheMoney. It is a Python schema for investigative entities. Person, Company, Asset, Address, Payment, Ownership. Used by the International Consortium of Investigative Journalists, by OCCRP, by OpenSanctions itself. Permissively licensed. Genuinely active. Four parallel scans converged on it independently, which is the strongest signal the planning has produced.

But the schema is anti-corruption-shaped. It assumes financial-crime use cases. The mining investigation does not. A claim, a drill site, a parcel of bedrock, an exploration permit. These are physical objects that the FollowTheMoney ontology can be stretched to cover but does not name natively. The path of least resistance is to subclass the Asset type and let it carry mineral-specific properties. The path of most independence is to grow a vocabulary from zero.

The planning document sided gently with independence. Starting from our own data model is cheaper than translating from theirs. But Codex flagged that argument as the weakest in the document. The case for adopting the spine is that you get cross-investigation entity linking for free, with twenty years of refinement behind it. The case against is that every cross-investigation database decision is then mediated through somebody else's concepts.

The question is whether the substrate inherits a vocabulary or invents one. Both paths are defensible. Only one can be taken.

The AGPL Boundary

Datashare exists. It was built by the International Consortium of Investigative Journalists. It ingests two hundred and twenty five mining documents in seconds. It has full-text search, in-browser PDF rendering, entity browse, and a polished Vue user interface. The omega-two scan stood it up and it worked. The capability gap it closes for the substrate is real.

It is also licensed under the Affero General Public License, version three. The Affero clause, section thirteen, fires whenever the software is conveyed over a network. Conveying does not mean public deployment. It means network access from any user. An internal tool with a private user interface is not exempt.

The clean boundary is to run Datashare as a separate, unmodified service. We issue API calls. Nothing lifts. Nothing modifies. The combined work is just Datashare. The substrate can embed Datashare in an iframe and let it do what it does.

The unclean boundary is to lift the viewer code, or to modify Datashare and expose the result over our user interface. That triggers the source-offer obligation on the combined work. For an internal tool that may someday graduate to a public deploy, that obligation is a future-cost worth pricing.

The question is whether the substrate keeps Datashare alongside as a co-located document reader, or whether it builds the document layer itself. Both are coherent. The license boundary is what makes the question interesting.

Aleph And Its Successors

OpenAleph is the active fork of the Aleph platform. Aleph was the document and entity engine behind the Panama Papers and Pandora Papers investigations. OpenAleph is maintained by the Data and Research Center, based in Berlin. The license is permissive. The code is reachable. The architecture is investigative-journalism-shaped from day one.

The planning document rejected OpenAleph as the base of the substrate. The decisive reason was foreign maintenance surface. Adopting OpenAleph as the base means carrying a large upstream codebase whose architecture decisions are not ours to make and whose update cadence is not ours to set. For a tool whose value is its fit to Pär's workflow over years, that tax is real and ongoing.

But the document left open whether to lift code from OpenAleph piecemeal. Their entity-extraction pipeline. Their FollowTheMoney schema usage. Their document-ingest patterns. Their Vue viewer. Each is a candidate. Each is permissively licensed and individually liftable.

Codex flagged the risk. Lifting these pieces silently imports the architectural assumptions the foreign maintenance argument flagged. The assumptions are not visible at the function level. They compound at the cross-investigation layer.

The question is how the substrate relates to OpenAleph. Lift code. Lift patterns only. Run it alongside as a service. Or skip it entirely and consult only as reference. The answer probably differs by capability.

One Repo Or Many

The substrate sits as a sibling to arebladet two. Not inside it. Not a subdirectory. A peer. That much is decided.

What is not decided is the inside of the substrate itself. One Python project with a Svelte frontend inside. Several coordinated services with separate repositories. A single repository with multiple top-level packages. A polyglot stack with a Python core and a JavaScript user interface as a sibling project.

Running Datashare alongside already pushes the substrate into polyglot territory. The Java service is just there. The question is whether the substrate's own code follows the same pattern, or stays unified.

There are good arguments for both. One repository simplifies bootstrap and reduces cognitive overhead. Several repositories allow component isolation and independent versioning. The investigation substrate is meant to grow over years. The shape it starts with sets a default that is hard to undo.

The question is where the boundaries between projects go. The answer constrains every later question about deployment, versioning, and how the data layer exposes itself.

The Storage Question

PostGIS plus TimescaleDB is the boring correct answer to storing events with locations and times. The summary scan called it that. Store events as a geometry plus a timestamp. Index the time as a hypertable. Index the geometry with a spatial index. Polygon plus time queries become one line of SQL.

For a single journalist working through one investigation at a time, that may be heavier than the workload needs. SQLite plus flat files plus an in-process spatial library would handle Aura and Häggån's working corpus without breaking a sweat. The cost is paid in upgrade friction when the corpus eventually does grow.

There is also the property-graph question, which is adjacent but distinct. Kùzu is an embedded property graph with vector and full-text search built in. Apple acquired the company in twenty twenty five. The code remains permissively licensed. For "show me all entities connected to company X across investigations," Cypher beats SQL. For everything else, Postgres beats Kùzu by inertia.

The question is how much storage infrastructure the minimum viable product needs. The Postgres-heavy answer trades early infrastructure for late simplicity. The SQLite answer trades early simplicity for a migration later. The property-graph question shadows both. None of them is wrong.

That closes Part One. Six questions. Each blocks something in the next planning document. The name blocks the first commit. The spine blocks the entity table. The Affero boundary blocks the document layer. OpenAleph's posture blocks the ingest layer. The project shape blocks deployment. The storage question blocks everything that touches data.

Part Two opens here. The same kind of question, sized differently. These do not block this week. They shape the substrate over the year that comes after.

Part Two. What Minimum Looks Like

Section four of the planning sequence has not been written yet. It is called minimum viable product scope. The constraint is sharp. The substrate must be sized to support starting Aura and Häggån research. Not to ship the Aura and Häggån article. The line is real and consequential.

What does that look like, concretely? A walkthrough mode for the first document of each new type. A scaffolded ingest path that captures structure on the second pass. A timeline view that accepts events from any source. A spatial filter that takes an area rather than a point. A document browse that lists things by entity, with one-sentence explainers. An incoming landing zone that accepts documents Pär does not yet know what to do with.

The temptation in writing the section will be to add. Article generation. Fact-check workflows. Source-cluster discovery. Press-question sequencing. Each one is genuinely useful. Each one is also article-ship work, not investigation-start work.

The planning's hard discipline is to keep capacity-filling out of section four. The substrate exists to start investigations, not to perform completeness. The question is what gets cut, not what gets added. The smallest useful version is also the one that earns its keep first.

The Same Person In Two Investigations

This is the load-bearing question. The framing called it capability four, cross-investigation discovery via a connected database. The dream is a single tool with different article areas, where working on a completely different story finds a random connection to a mining thing. The substrate's value compounds over investigations the way gruvor's value did not.

For that to work, the substrate has to know when a company in the Aura investigation is the same company that appeared in some unrelated piece months earlier. Same name is not enough. Names drift. Subsidiaries restructure. Spelling varies. A person who appeared as a board member in one filing and a witness in another needs to collapse into one entity, or the connection is invisible.

Nomenklatura is the candidate. It is part of the OpenSanctions ecosystem, permissively licensed, very active. Its design is exactly this. A judgement store records human decisions about which entities are the same. Those judgements persist across runs and across new corpora. Splink is the heavier-weight alternative once the entity count climbs above a hundred thousand.

The question is what shape that persistence takes. Canonical identifiers assigned by a resolver. Content-derived fingerprints over name and country and identifier. Per-investigation local identifiers reconciled later. The data model has to support whichever shape is chosen, and the choice is not local. It propagates everywhere.

Events Belong To Areas

Gruvor had a spatial filter. It took a point. The user dropped a pin on a map, the system showed events near that pin within a radius. That is how a lot of mapping software works.

Aura and Häggån needs an area filter. Not a point. A polygon. Or a named region. The Viken cluster is not a place you put a pin on. It is a region with edges. The exploration history of the entire cluster is the chronology Pär needs to walk through.

The planning document flagged this as the unresolved part of capability one. Area-filtered event chronology requires a data model that joins events to typed places or areas without forcing point geometry. If the model is not specified, the substrate re-derives gruvor's point-only filtering by accident.

There are clean options. Typed place entities with associated bounding boxes. Named regions with stored polygon geometries. Events linked to places by foreign key, places linked to areas by polygon containment. The spatial primitives exist. The question is which one the data model picks first.

The wrong move is to add it later. Late spatial schema changes are expensive. The right move is to specify the events-to-areas shape before any loader writes a single event row.

When The Loader Runs Itself

Gruvor had thirty hand-coded document loaders. Each one was a Python script that knew the shape of one document type and parsed it into the database. Maintaining thirty of them was a debt. Writing a thirty-first was a small panic.

The framing's capability two replaces that pattern. Per-document-type review user interface for ingest. Claude processes one sample of a new document type. The structured output and the original document are shown side by side in a user interface. Pär validates a small number of additional samples. When the agreement is high enough, the loader clears for autonomous running.

The shape is mostly clear. The implementation is mostly clear. What is unclear is how programmatic the loop actually gets. How many samples is enough to clear? What happens when a document drifts and the loader starts failing silently? Does Pär get notified, or does the user interface re-open the loop for that document type?

The question is when the loader runs itself, and when Pär still has to look. Too aggressive and the substrate eats bad data. Too conservative and ingest still requires hand-holding for every type. Finding the level where Pär trusts the loader is the design work that has not yet started.

The Sibling Becomes One

The substrate is modular toward arebladet two. Not built into arebladet two. The framing was explicit. Sibling repository, not subdirectory. Shared data layer eventually, but not now. Clean export interface from day one, so the merge can be grown into rather than retrofitted later.

That posture defers a question rather than answering it. When does the integration actually happen? What triggers it? Is it a data-layer merge where both projects read from the same store? Is it a service boundary where one project calls the other? Is it a graduation, where the substrate's mature pieces fold into arebladet two and the substrate itself dissolves?

The chatarkiv archive, the arebladet email archive, the published article corpus. These all sit elsewhere. The substrate's connected database would benefit from reaching into them. Eventually it will. The question is whether eventually means six months or three years.

The planning does not answer this. It cannot. The substrate has not yet produced enough investigations to know what the integration shape wants to be. The question stays open until the use case sharpens.

The Story That Has Not Arrived

The Aura and Häggån story has no clear shape yet. The X ninety two investigation was a single-company anomaly. A beneficial-owner thread. A dispens that looked routine until somebody actually compared durations. The whole article hung off one cluster of facts.

Aura and Häggån will not be that. The entire Viken area is covered in prospecting permits. The storyline is much longer. It involves more people, more companies, more time. The historical media coverage of the Viken cluster is source material for the investigation, not a footnote at the end.

The substrate cannot be designed against the last article's surprises. That is the warning the framing document repeats twice. Building toward the X ninety two pattern is the wrong shape. The next story's surprises are different by construction.

The question is when the Aura and Häggån storyline emerges, and from what. It might emerge from a single document, the way X ninety two emerged from the beneficial-owner filing. It might emerge from a pattern across many sources, the way deeper investigations sometimes do. It might emerge from a conversation, a tip, a tangent in an unrelated investigation.

The substrate's job is to make all three of those paths possible. The story will arrive when it arrives. The tool exists so that when it arrives, Pär is already looking in the right direction.

The Last Thing

Twelve questions. Six urgent, six bigger. Each one is the absence of a decision, and absences are not the same as gaps. A gap is something missing from a design. An absence is something the design has deliberately not closed yet, because closing it requires a kind of judgment the document cannot make on its own.

The planning's discipline so far has been to keep these absences honest. To name them. To not paper them over with confident-sounding answers. The acceptance signal at the bottom of each section asks whether nothing important is missing and whether nothing important is included that Pär does not want. Both questions are real.

The next move is Pär's. Six of these answers shape section three of the planning sequence, which is the per-component scorecard against the load-bearing capabilities. The other six wait for section four and beyond.

There is no rush. The substrate exists to do work that pays off over years. The cost of getting these questions wrong is measured in those years. The cost of taking time on them is measured in days.

Before the first commit, twelve doors. Pär picks which one opens first.