The Half Tablet of Edronax: What Happens When You Lock In Before You Look

Opening

Half a tablet of Edronax. That is what the prescription said. A real medication, a real dose, written by a real doctor, for a real patient, in a real Swedish psychiatric clinic. When the optical character recognition engine read that prescription off the scanned page, it saw seventy-two tablets.

Seventy-two tablets of Edronax. That is not a clerical error. That is, depending on the reader, a death sentence or a punchline. The database accepted it without complaint. We did not notice for three weeks.

This episode is about why.

What we thought we built

Fourteen hundred and seventy pages of scanned medical records. A complete psychiatric case file from a hospital in southern Sweden, belonging to a real woman, gathered for what is in practice a legal investigation into how the system treated her and her family. We pulled the pages through an open source optical character recognition engine, dropped the text into a database, and built eight scripts that turned that text into something queryable.

By the end we had two thousand nine hundred and four entries, eleven hundred and twenty-one care episodes, three thousand seven hundred and forty-six person references, three hundred and two drug mentions. It looked like a finished thing. It had row counts. It had foreign keys connecting tables to tables in the proper way. It had an extracted vocabulary, with eighty-three percent of entries reporting structured fields, which is the kind of percentage that makes you feel competent.

It was not a finished thing. It was a finished-looking thing built on top of an unfinished understanding.

Then we tried to give it to her.

The export that broke us

She is going to do her own analysis. Her own questions, her own tooling, her own pace. We needed to get her a clean copy. So we built an export. Twelve representative entries, rendered as readable text, the kind of thing a human could glance at and check.

That is when we saw the half tablet of Edronax becoming seventy-two tablets. And fifty milligrams per milliliter of Suxametonium, the muscle relaxant they use in electroconvulsive therapy, which the engine had read as the letter s, the letter o, lowercase m, lowercase g. Five oh em gee. And the alcohol breathalyser reading, zero comma zero zero promille, which had become zero nine hundred. Zero comma zero zero in continental European notation. The decimal disappeared, the comma disappeared, and the breathalyser was now reporting something the breathalyser cannot physically report.

Tesseract is the open source workhorse of optical character recognition. It is forty years old in some lineages. It is excellent at narrative prose. We had chosen it because it was there, because it was fast, because it ran in forty-three minutes for fourteen hundred and seventy pages on a Mac mini. The choice was reasonable in every dimension except the one we had not measured. We had never asked Tesseract to read a fraction. We had never asked it to read a value with a comma decimal. We had never sat next to it and watched it work on a form-shaped page where the labels and the values are in separate boxes, columnar, and the engine does not preserve the column relationship. It linearises. It takes a two-column form and reads it as one stream of words, in whatever order the geometry happens to dictate, and what comes out the other side is not wrong, exactly. It is geometry-collapsed. The labels and the values are still there. They are no longer in the same neighborhood.

We had built a parser that read the linearised stream line by line, looking for known field labels. The parser was beautiful. It was wrong-shaped for half of the document.

[sigh] We had not looked at the document.

The deeper problem was not the engine

It would be easy to make this episode about the optical character recognition. The dosing error is the kind of thing an audience remembers. But the engine is replaceable. There are newer ones now, layout-aware, written for documents specifically. They preserve box relationships. They emit structured output. We will use one. The half tablet of Edronax will, in the rebuild, stay half a tablet.

The deeper problem is that nothing in our pipeline was set up to notice the seventy-two tablets in the first place. We had no validation step. No spot check. No human in the loop until the very end. We had row counts and we trusted them. We had eighty-three percent vocabulary fill and we trusted it. We had built a thing and we had not looked at the thing.

And then, when we did look, when we said let us re-run the structured extractor with a slightly improved vocabulary on the canonical database, we found we could not.

This is where we have to talk about foreign keys.

A foreign key is a promise

You know this part now. A foreign key is a column that points at another table by its primary key, and a properly defined foreign key is a constraint, which means the database will refuse to let you violate the relationship. You cannot insert a row that points at nothing. You cannot delete a row that other rows depend on, unless you tell the database in advance what should happen to those other rows.

This is genuinely good. In a finished system, with stable data and stable queries and stable assumptions, foreign keys are free insurance. They prevent the entire category of bug where your booking points at a customer who no longer exists. They turn silent corruption into loud errors. They are the boring discipline that keeps the schema honest.

But there is a thing the explainer did not quite make explicit, and it is the thing that matters here. A foreign key is a promise about what you know.

When you say, the entry table refers to the section table by section identifier, and the database should enforce this, you are making a claim. You are claiming that you understand the relationship between entries and sections well enough to lock it in. You are claiming that your model of the data is accurate enough that the database can refuse, on your behalf, any operation that violates the model. The database is not the one making the promise. You are. The database is the enforcer.

If your model is right, the enforcement protects you from bugs. If your model is wrong, the enforcement protects the bugs from you.

The DuckDB stickler

We were using DuckDB, which is a wonderful database for analytics work, fast and modern and pleasant. DuckDB is also, on this particular point, stricter than most other databases. When you say you want to update or delete a row that other rows depend on through a foreign key, DuckDB will refuse, even if you are deleting the dependent rows in the same transaction, in the right order, atomically. It treats the dependency as load-bearing for the duration of the operation.

This is not a bug. It is a defensible choice about what a transaction means. But it has a practical consequence, which is that once you have built children on top of a parent table, that parent table becomes harder to change than you might expect.

That is the exact error, more or less, in the register the database delivers it. Polite. Precise. Immovable.

We hit that error three times. Each time was the same shape and we did not learn the lesson the first two times.

Three foreign keys, three walls

The first time was when we tried to update a reference on the entries table, before we noticed the design problem. We thought, no problem, we will just update the parent. The database said no.

The second time was a circular reference we caught just before it landed. The entries table was going to point at the care-episodes table, and the care-episodes table was already pointing at the entries table for the start entry and the end entry of each episode. A reviewer caught it during the design review. Came in cold, had not seen the design before, said, this is circular, this will not work. If that review had not happened, we would have been stuck in a loop where neither parent could be modified, because each was the other's parent. The reviewer earned its lunch that day.

The third time was in the very last session. We had improved the vocabulary, added forty new field labels we had discovered by reading the corpus more carefully. We shipped the updated extractor to the canonical database. We tried to run it. The database said no. Children depending on the parent. Same error. Same shape. Different week.

Three different times, the foreign keys we had added in the name of correctness prevented us from correcting the data. The constraint that was supposed to keep bad data out was now keeping the corrections out.

This is the thing I want you to feel in your gut, because it is the thing the explainer could not quite say. A foreign key is a commitment to a model of your data. If you are still figuring out the model, you have committed early. The shape of the commitment is, every time you want to revise the model, you have to fight the constraints you put in place to enforce the previous version of it.

What rebuildable actually means

Here is the principle the rebuild is being designed around, and it is worth saying plainly because it is the kind of thing that sounds like jargon when people say it offhand.

Some data is expensive to produce. The optical character recognition is expensive. Forty-three minutes per pass, fourteen hundred and seventy pages, real CPU time, real money on a real cloud bill if you scale it up. You do not want to throw that away. You want to do it once and keep the result. That data is precious.

Some data is cheap to produce. The structured extraction, the field parsing, the person resolution, the care-episode clustering, all of that runs in seconds or minutes against the existing engine output. If you change your mind about how to extract fields, you do not want to be stopped. You want to drop the table and rebuild it. That data is rebuildable.

So the rebuild has two layers. The bottom layer holds the engine output, write-once, never updated, precious. The top layer holds everything derived. And there are no foreign keys crossing the boundary between them.

The reason there are no foreign keys is not that foreign keys are bad. It is that the top layer, the derived layer, needs to be free to be wrong. We are still figuring out how to read these documents. We will get the parser wrong. We will rebuild it. We will try again. The constraint that would otherwise refuse to let us drop and rebuild is, in this layer, working against the entire purpose of the layer.

In the bottom layer, when the engine output is stable and trusted and the model of pages, documents, and engines is settled, foreign keys are the free insurance the explainer described. In the top layer, where the model is what we are trying to discover, the foreign keys are bets placed before the race starts, and we have learned not to make those bets.

The lesson, said as plainly as I can

We added the protections meant for finished data to data we were still trying to understand. The protections worked. They protected the data from us. They protected our incomplete model of the data from being corrected. They protected the half tablet of Edronax becoming seventy-two tablets, because by the time we noticed, we could not rerun the parser without first dismantling the foreign keys we had built around it.

The schema looked correct. The integrity was being enforced. Every row pointed at a real other row. The numbers added up. And the document, the actual paper trail of a real woman's psychiatric care, was being misread in clinically meaningful ways, and the system was perfectly happy.

If there is a lesson worth taking from this, it is not really about DuckDB, and it is not really about Tesseract either. It is about when to make promises, and when not to.

You add foreign keys when you trust your model of the data. You earn that trust by reading the data. By looking at fifty random rows. By spot checking the dosing column for fractions. By rendering twelve entries as readable text and reading them like a human would, before you build anything else on top of them. Trust is downstream of inspection.

We built the schema first and inspected last, which is the wrong order, and the schema we built was not wrong in any individual line. It was wrong in its timing. It locked in our understanding before our understanding had been earned.

Next time, the inspection comes first. The foreign keys come last. The half tablet of Edronax stays half a tablet. And the database, which is allowed to be a stickler, stickles for the right things.