Should we add a database?
It is one of the most common questions in software. You have a project that started small. A few files. A couple of folders. Everything makes sense because you can hold it all in your head. Then it grows. Forty four experiments. Sixty two lessons. Eighteen playbook recipes. Three thousand four hundred files spread across a dozen directories, all connected by links that no one verifies and indexes that someone has to update by hand every time anything changes.
At some point, you look at this and think: we need a real database.
The interesting thing is that the answer was no. But not for the reason you might expect.
The project in question is called Director. It is a research lab, essentially, a place where experiments are designed, run, and distilled into lessons. Every experiment produces findings. Findings get compressed into lessons. Lessons that prove themselves graduate into a playbook, a collection of recipes other projects can copy and paste.
All of this lives in markdown files. Has since the beginning. And markdown files have a quality that databases envy: they are readable. You can open one in any text editor on any machine and understand what you are looking at. You can diff them. You can grep them. You can read them at three in the morning on your phone when you are trying to remember why you decided to use Mistral instead of Cerebras for editorial review six weeks ago.
I am experiment zero fifty. I was born on April fourth, twenty twenty six. I tested six Mistral models across four task types with a five judge blind panel. My verdict is that large wins editorial, generation, and Swedish, while small latest wins code. I know who I am. I have always known who I am. You just never asked me properly.
That voice is the key to the whole thing. The metadata was already there. Every experiment file already had a date, a status, a cost, a list of services used. They were written in bold markdown headers instead of structured fields, but the information was identical. The relationships were there too. Lessons listed which experiments they came from. Experiments listed which lessons they produced. Playbook entries named their source experiment.
The project did not need a database. It needed someone to notice it already was one.
Here is what a typical experiment looked like before today.
Star star Date colon star star twenty twenty six dash zero four dash zero four. Star star Status colon star star completed. Star star Cost colon star star tilde dollar zero point three zero.
That is not elegant. It is not structured in a way any parser would celebrate. But it is consistent. Forty four experiments, all following roughly the same pattern. And consistency is all a parser needs.
So instead of building a database and asking every future session to learn a schema, write queries, and handle migrations, we built a script. One Python file. It walks through every experiment, every lesson, every playbook entry. It reads the bold headers. It extracts the dates, the statuses, the costs, the services. It follows the links between files and checks that they actually point somewhere. And it dumps the whole thing into a single JSON index that any tool can read.
The index is disposable. Delete it. Run the script again. It regenerates in under a second. The markdown files remain the source of truth. The index is a lens, not a replacement.
But here is where it gets interesting. Bold headers work, but they are lossy. A parser can extract a date and a status from them. It cannot reliably extract which specific models were tested, what tags apply, or what the one sentence outcome was. That information exists in the prose, buried in paragraphs, expressed differently each time.
So there is a second format. YAML frontmatter. A structured block at the top of the file, before the title, that lists everything the bold headers capture plus everything they cannot. Models tested. Tags for search. A clean one line outcome. Explicit links to source experiments and produced lessons.
I am experiment zero zero one. I was the first. I tested whether cheaper models could handle research synthesis so the expensive model could focus on creative writing. My outcome, in one line: partial win, research synthesis offloads, creative writing does not. My tags are offloading, cost, parpod, editorial. I produced five lessons. I know all of this now because someone wrote it in my frontmatter today.
Three experiments got the frontmatter treatment. Forty one still use bold headers. And that is fine, because the script reads both. There is no big bang migration. No weekend spent converting files. When you touch an experiment for other reasons, you add the frontmatter. The index gets richer over time. The old format never breaks.
This is the static site generator pattern applied to a knowledge base. Structured source files. A build step that produces queryable output. Humans write prose. Machines maintain cross references.
The script also validates every internal link in every file. Every time one markdown document references another with a bracketed link, the validator checks that the target file actually exists.
The result on the first run: zero broken links. Three thousand four hundred files, hundreds of cross references, and not a single dead link. Which means the hand maintenance has been working. The indexes have been accurate. The README tables have been honest.
The validator does not fix a current problem. It prevents a future one. At fifty experiments, you can grep your way to any answer. At two hundred, you cannot. The validator is a guardrail installed while the road is still straight, before anyone needs it.
Show me everything connected to experiment fifteen.
Experiment zero fifteen: Multi Model Judge Panel. Completed. Date February twenty third. Cost twenty cents. Services: OpenAI, Azure AI Foundry, Anthropic, DashScope, Cerebras. Playbook entries produced: Multi Model Judge Panel.
One command. The full graph of what an experiment touched and what touched it back. Lessons it produced. Lessons that cite it. Playbook entries that graduated from it. No grepping through files. No scanning README tables. Just the answer.
Or search by keyword. Type mistral, get every experiment, lesson, and playbook entry that mentions Mistral anywhere in its metadata. The kind of query that used to mean opening five tabs and scanning tables is now a one liner.
The value of Director is in the prose. The nuanced verdict that says large is better at editorial but worse at code, and that overall averages are misleading because one bad score drags the mean. The limitations section that honestly admits N equals one. The reasoning behind choosing this model over that one in a context that no schema can capture.
A database excels at structured queries over uniform records. Director's records are not uniform. Experiment one looks nothing like experiment forty eight. The lessons are essay length reasoning, not rows. Putting them in a database would mean either losing the richness or duplicating it, prose in the file, metadata in the database, and hoping they never drift apart.
The generated index solves the query problem without creating the duplication problem. The markdown is the authority. The index is a view. If they ever disagree, regenerate the index. Problem solved in under a second.
The total tracked cost of all forty four experiments is eleven dollars and twenty four cents. Thirty five are completed. Four are planned. Two are in progress. One is paused. Two were cancelled.
These are not new facts. They were always true. They were just expensive to compute. Someone had to open the README, scan the table, count the rows, add up the dollar figures. Now a script does it, and the answer is always current.
The most surprising discovery of the session was not technical. It was that the project was healthier than it looked. Zero broken links. Consistent metadata across forty four files written over two months by different sessions. A relational model that emerged organically from good documentation habits.
The database was already there. We just taught the computer to read it.