Databases: The Ones Under Your Code

Introduction: Previously, on Tables All the Way Down

Last time, we traced the history of databases from filing cabinets to filing lawsuits. We met Edgar Codd, the British mathematician who turned data into set theory in thirteen pages. We watched Larry Ellison read IBM's published research and beat them to market with Oracle, a product so aggressively sold that it shipped as Version Two because Version One sounded too risky. We watched Oracle's president tell financial analysts, in those exact words, that customers have no choice but to pay. We watched Monty Widenius name databases after his daughters and then fork his own creation when Oracle took it. We left off in approximately the year two thousand, with SQL everywhere and the relational model having won so completely that most developers had stopped thinking about it entirely.

And then Google published a paper. And Amazon published a paper. And a generation of engineers read those papers and decided that everything they knew about databases was wrong. Some of them were right. Most of them were not. But they all built things, and the things they built are still running, and some of those things are running on Par Boman's VPS in Paris right now, which means this is also his story, whether he likes it or not.

This is part two. The databases under your code.

The Little Database in a Warship

In the spring of two thousand, a software consultant named D. Richard Hipp was working at Bath Iron Works in Bath, Maine. Bath Iron Works is a subsidiary of General Dynamics. They build warships. Specifically, Hipp was writing software for the USS Oscar Austin, a guided missile destroyer, hull number DDG seventy-nine. The software's job was straightforward. Track the state of every pipe and valve in the damage control system. If the ship takes a hit, the crew needs to know instantly which sections can be sealed, which valves need closing, which pipes are compromised. This is the kind of application where reliability is not a feature request. It is the difference between a ship that survives and a ship that sinks.

The database backing this system was Informix, a commercial relational database made by a company that would later be acquired by IBM. Informix ran on a separate server. The application talked to Informix over a network connection. This arrangement worked perfectly, except for the times when it did not work at all. When the Informix server went down, which servers do, Hipp's application would display a dialog box. The dialog box said, in the helpful manner of software that has given up, "cannot connect to database server."

This dialog box appeared on a warship. In a damage control system. The software that was supposed to tell the crew which valves to close in an emergency was instead telling them that it could not reach a computer on the other side of the ship. And because Hipp was the contractor whose code painted that dialog box on the screen, Hipp got blamed. Not Informix. Not the network. Not the architecture that required a separate database server for an application running on a single machine. Hipp. Because his name was on the code that said the words.

Hipp is a careful person. Born April ninth, nineteen sixty-one, in Charlotte, North Carolina. Master's degree in electrical engineering from Georgia Tech in nineteen eighty-four. Worked at A T and T Bell Labs. Went back for a doctorate at Duke, finished in nineteen ninety-two under Alan Biermann. Ran a small consulting company called Hwaci, which stands for Hipp, Wyrick, and Company, named after himself and his wife Ginger Wyrick. Pronounced "hwah-chee." The kind of company that has a few employees, no investors, no debt, and makes its money by solving hard problems for people who have them.

Sitting on that warship, getting blamed for a database server he did not control going down at a moment that mattered, Hipp had an insight that would eventually put a database inside every phone, every browser, every operating system, and every airplane on earth.

I had this crazy idea that I'm going to build a database engine that does not have a server, that talks directly to disk, and ignores the data types. And if you asked any of the experts of the day, they would've said, that's impossible. That will never work. That's a stupid idea. Fortunately, I didn't know any experts, and so I did it anyway.

That is D. Richard Hipp describing the moment he decided to build SQLite. The name tells you everything. SQL. Lite. A lightweight implementation of SQL that runs inside your application, not beside it. No server. No network connection. No separate process. The database is a single file on disk. If the computer is healthy enough to run your program, the database is available. Full stop. No dialog boxes about connectivity. No dependency on a server that can fail independently of the application it serves.

The first version appeared in August of two thousand. It was released into the public domain. Not MIT license. Not Apache license. Not any license at all. The public domain. This was a deliberate choice, and the reason for it reveals something about Hipp's character that is worth understanding.

He had watched what happened to Berkeley DB, another embedded database. Sleepycat Software built it and maintained it for years as open source. Then Oracle acquired Sleepycat in two thousand and six. Overnight, the licensing terms changed. Projects that had depended on Berkeley DB's open source license suddenly had to deal with Oracle's licensing team, which, as we discussed at length in part one, is an experience roughly comparable to being audited by someone who already knows how much they want you to owe. Hipp did not want that to happen to SQLite. By placing it in the public domain, he made acquisition irrelevant. You cannot buy something that nobody owns. You cannot change the license on something that has no license. Any company that tried to make SQLite proprietary would find that anyone else could simply continue using the existing public domain code. The public domain is permanent. It is the one intellectual property state that no corporate action can undo.

Now here is the part that separates SQLite from almost everything else in the history of software. The testing.

SQLite's source code is approximately one hundred and fifty-five thousand eight hundred lines. The test suite that verifies that source code is approximately ninety-two million lines. That is not a typo. Ninety-two million. The ratio is five hundred and ninety to one. For every line of code that does something, there are five hundred and ninety lines of code that check whether that something was done correctly. The test suite achieves one hundred percent modified condition and decision coverage, which is a testing standard used for avionics software and medical devices. It means every branch, every condition, every logical path through the code has been tested. Every release since version three point six point seventeen in two thousand and nine has met this standard.

To put this in perspective, most well-tested commercial software achieves maybe seventy to eighty percent code coverage and considers that respectable. Aviation-grade MC/DC coverage is the standard that the software in your airplane's flight control system is held to. SQLite meets it voluntarily. For a free, public-domain database engine maintained by a small company in North Carolina. Because Richard Hipp decided that if his software was going to run inside damage control systems on warships, it should probably work.

And then it turned out that warships were just the beginning.

SQLite runs on every iPhone ever made. Every Android device. Every Mac. Every Windows ten and eleven machine. It is the storage engine inside Firefox, Chrome, and Safari. Skype uses it. Dropbox uses it. The Airbus A three fifty, one of the most advanced commercial aircraft in the world, runs SQLite in its avionics systems. The current estimate is that there are over one trillion active SQLite databases in the world right now. One trillion. That makes it, by a comfortable margin, the most deployed piece of software in human history. More than Linux. More than Windows. More than anything. Because it is inside everything.

When Airbus adopted SQLite for the A three fifty, they had a question. How long will you support this? The answer matters because the operational life of a commercial airframe is typically forty years. Airbus needed to know that the database inside their avionics would still receive bug fixes and security patches when the plane was still flying in twenty sixty.

Hwaci said yes. They published what they call the twenty fifty commitment, a promise to maintain backwards-compatible support for the SQLite file format and API through the year two thousand and fifty. This is a company of a handful of people making a support commitment that extends twenty-five years into the future for software that runs inside airplanes. They can make this commitment because of how the company is structured. Hwaci is small, private, has no investors, carries no debt, and is structured specifically to avoid acquisition. There is no venture capital fund that will force an exit. There is no board of directors that can sell the company to Oracle. The company exists to maintain SQLite, and it will continue to exist as long as Richard Hipp wants it to.

The funding comes from the SQLite Consortium. Members pay seventy-five thousand dollars per year or more for priority support, influence over the development roadmap, and the comfort of knowing that the database running inside their products is professionally maintained. Original members included Mozilla, Symbian, and Adobe. The consortium is supplemented by custom development contracts and professional support agreements. It is enough. More than enough. The company has never needed outside investment because the company has never tried to grow beyond what the product requires.

I wrote SQLite because it was useful to me and I released it into the public domain with the hope that it would be useful to others as well.

That is the whole mission statement. There is no pivot. No Series A. No plan to become a platform. A man built a thing that solved a problem he personally had, on a warship, and gave it to the world, and the world put it in everything. One trillion databases. Every pocket. Every dashboard. Every cockpit. And the test suite has five hundred and ninety lines of verification for every line of functionality, because D. Richard Hipp does not ship things that might display a dialog box that says "cannot connect to database server" at a moment when someone's life might depend on the answer.

The Elephant in Every Server Room

We told the origin story of PostgreSQL in part one. Stonebraker at Berkeley, INGRES in seventy-four, POSTGRES in eighty-six, Postgres ninety-five when Andrew Yu and Jolly Chen added SQL, PostgreSQL in ninety-six when the name got its final, unpronounceable form. We told you about the elephant mascot, Slonik, proposed in nineteen ninety-seven, because elephants never forget. What we did not tell you is what happened next, because what happened next is the reason PostgreSQL went from a respected academic database to the answer to almost every database question anyone asks in the twenty-twenties.

The key insight is the extension system. PostgreSQL does not just store data and let you query it. PostgreSQL lets you define entirely new data types, new operators, new index methods, and new query strategies as first-class objects within the database itself. This is not a plugin system bolted on as an afterthought. It is fundamental to the architecture. Stonebraker designed POSTGRES specifically to be extensible because he had learned from INGRES that a database that can only do what its creators anticipated will eventually run into problems its creators did not anticipate.

In two thousand and fourteen, PostgreSQL nine point four shipped with a feature called JSONB. Binary JSON with full indexing support. You could store a JSON document inside a PostgreSQL column, create indexes on fields nested three levels deep inside that document, and query those fields with full SQL join capabilities against your regular relational tables. In a single query, you could join a traditional table of customers with a JSONB column containing their preferences and filter on a nested field called notification dot frequency dot weekly. The query planner would use the index. It would be fast.

This mattered enormously because of what was happening in the wider database world at the time, which we will get to shortly. But the short version is this. A very large number of developers had adopted MongoDB specifically because it let them store JSON documents without defining a schema first. When PostgreSQL added JSONB, it could do the same thing, but inside a real relational database, with ACID transactions, with joins, with the entire SQL language available, and without the exciting surprise of discovering that your data was inconsistent because you had no schema to enforce consistency.

The extension system produced PostGIS, which turns PostgreSQL into a geographic information system. You can store latitude and longitude coordinates, calculate distances, find all restaurants within two kilometers of a point, compute the intersection of geographic regions, all inside SQL queries that join against your regular tables. PostGIS is used by government agencies, mapping companies, logistics firms, and anyone else who needs to ask spatial questions about their data.

And then came pgvector. In the twenty-twenties, as everyone started building applications that use large language models and need to find semantically similar text, the database world was suddenly full of specialized vector databases. Pinecone. Weaviate. Qdrant. Each one a separate system you had to deploy, maintain, connect to, and pay for. pgvector is a PostgreSQL extension that adds vector similarity search. You store your embeddings in a PostgreSQL column, build an index on them, and run nearest-neighbor queries joined against your regular relational tables. In a single SQL statement. No separate system. No additional operational burden. Just PostgreSQL doing one more thing that used to require a specialized tool.

This pattern has a name in the PostgreSQL community. They call it "Postgres for everything." The argument is simple. Why deploy MongoDB when PostgreSQL has JSONB? Why deploy Elasticsearch when PostgreSQL has full-text search? Why deploy a dedicated vector database when PostgreSQL has pgvector? Why deploy Redis for caching when PostgreSQL has UNLOGGED tables and advisory locks? Every specialized database you add to your stack is another system to deploy, monitor, back up, secure, and debug at three in the morning when something goes wrong. PostgreSQL is already running. It is already backed up. It is already monitored. Just use PostgreSQL.

The relational model has won. It won twenty years ago. Everything else is a niche. The question is not whether to use a relational database. The question is whether you need anything else in addition to it.

There is a beautiful irony in the mascot. Stonebraker himself, across decades of public talks and published papers, has referred to Oracle and IBM as "the elephants," using the word as a pejorative for large, slow, bureaucratic incumbents that resist innovation. Meanwhile, the database that grew directly from his own research at Berkeley chose an elephant as its mascot. An elephant named Slonik, the Russian diminutive for "little elephant," proposed in a bar in Saint Petersburg by early Russian internet pioneers who were among PostgreSQL's first international community members. Stonebraker calls the incumbents elephants. His own creation is an elephant. The little elephant that could do everything the big elephants charged millions for, and did it for free.

And no single company owns it. The PostgreSQL Global Development Group has no CEO, no shareholders, no venture capital investors who need an exit. Compare this to MySQL's journey from Monty's garage to Sun Microsystems to Oracle's portfolio. PostgreSQL cannot be acquired because there is nothing to acquire. The code is open. The community is distributed. The decisions are made by consensus among people who actually maintain the software. This structure is slow. It is sometimes frustrating. It is also the reason PostgreSQL is still here, still independent, still growing, thirty years after a couple of graduate students added SQL to a Berkeley research project.

The Great NoSQL Delusion

The year is two thousand and four. It is the holiday shopping season. Amazon's website is one of the most visited pages on the internet, and it is struggling. Not crashing, exactly. But the relational database infrastructure underneath it is hitting scaling limits that no amount of hardware can solve. The problem is not the queries. The problem is the guarantees.

A relational database promises ACID. Atomicity, consistency, isolation, durability. Every transaction completes fully or not at all. Every read sees the latest committed data. No two transactions interfere with each other. These guarantees are the reason banks trust relational databases. They are also expensive. Maintaining consistency across a distributed system requires coordination. Coordination requires communication. Communication takes time. Time is latency. Latency, on a shopping website during the holidays, is lost revenue.

Amazon's engineers made a choice. They built a system called Dynamo that deliberately gave up consistency in exchange for availability. If two copies of your shopping cart disagreed about what was in them, Dynamo would keep both versions and let the application figure it out later. This sounds insane if you are used to relational databases. It sounded insane to a lot of people at the time. But it meant that Amazon's shopping cart was always available, even when parts of the network were having problems. You could always add items. You might occasionally see a slightly stale cart. But the page never said "cannot connect to database server." Richard Hipp would have appreciated the priority.

Two years earlier, Google had published a paper about Bigtable, a distributed storage system designed to handle petabytes of data across thousands of commodity machines. Bigtable was not relational. It did not use SQL. It stored data in a sparse, distributed, multi-dimensional sorted map, which is a data structure that sounds like something a graduate student invented on a whiteboard and which Google was running at planetary scale. In two thousand and six, the paper was published at the USENIX symposium on operating systems design and implementation. The authors included Fay Chang, Jeffrey Dean, and Sanjay Ghemawat, names that appear on a remarkable number of the papers that defined how modern infrastructure works.

These papers were public. Anyone could read them.

Does that sound familiar? It should. Because this is the third time in the history of databases that the same pattern has played out. IBM published System R's research and Ellison built Oracle. Google and Amazon published their research and a generation of open source developers built everything else. The pattern is always the same. A large company solves a hard problem internally, publishes the solution because academic prestige matters to the engineers involved, and then watches in mild astonishment as the rest of the world takes their ideas and builds competing products with them.

Now we need to talk about the naming. Because the naming is the funniest part.

In two thousand and nine, a developer named Johan Oskarsson was working at Last.fm in London. He had been following the new wave of non-relational databases, Bigtable-inspired systems, Dynamo-inspired systems, various experiments in storing data without SQL, and he wanted to organize a meetup in San Francisco to discuss them. He needed a name. Specifically, he needed a hashtag, because this was two thousand and nine and Twitter was how developers organized everything.

He went on IRC and asked for suggestions. A developer named Eric Evans, who is a different Eric Evans from the one who wrote the famous book about domain-driven design, which is important because the naming confusion adds a layer of comedy to an already comedic story, threw out the term "NoSQL" in approximately forty-five seconds of thought. It was not meant to name a movement. It was not meant to name a category. It was meant to name a meetup. A hashtag for a Saturday gathering. And then it stuck, the way a nickname given at a party sticks to a person for the rest of their life, long after everyone has forgotten why.

The problem with the name was that it sounded like a manifesto. "NoSQL" implied that SQL was the problem and that these new databases were the solution. This was not what most of the people building these systems believed. They believed they were solving specific scaling problems that relational databases handled awkwardly. But the name suggested a revolution, and revolutions are exciting, and excitement attracts venture capital, and venture capital attracts marketing, and marketing has never met a nuance it could not flatten into a slogan.

MongoDB was the poster child. Founded in two thousand and seven as a company called 10gen by Dwight Merriman, who had previously been the chief technology officer at DoubleClick, the advertising company. Merriman and his co-founder Kevin Ryan originally set out to build a platform-as-a-service, a cloud hosting product. The platform failed. But the database they had built internally to power the platform turned out to be more interesting than the platform itself. They open-sourced the database in February of two thousand and nine. In August of two thousand and thirteen, the company renamed itself MongoDB Incorporated, because by that point the database was the company.

MongoDB stored data as JSON-like documents. No schema required. You could throw any JSON object into a collection and MongoDB would accept it. This was marketed as flexibility. No more writing migration scripts. No more altering tables. No more arguing with your ORM about column types. Just store the data. The developer experience was genuinely pleasant. It felt faster. It felt modern. It felt like progress.

And then, in September of two thousand and ten, someone made a video.

The video was made with Xtranormal, a website that let you type dialogue and have it performed by animated robot characters. The robots had flat, monotone voices and limited gestures, which made them perfect for deadpan comedy. Two robots sit across from each other. One is a MongoDB enthusiast. The other is a skeptic. The conversation goes approximately like this. The enthusiast explains that MongoDB is web scale. The skeptic asks what happens when you need to join data from two collections. The enthusiast says MongoDB is web scale. The skeptic asks about transactions. MongoDB is web scale. Data consistency? Web scale. What if you lose data? You just turn it off and on again. But it is web scale.

The video is called "MongoDB is Web Scale." It won a Webby Award in two thousand and eleven in the Viral category. It is one of the funniest things the tech industry has ever produced, and it is funny specifically because it is only slightly exaggerated. The actual marketing from MongoDB at the time really did emphasize scale above all other considerations. The actual conversations developers were having really did feature people dismissing concerns about data consistency with appeals to scalability. The video captured a moment of collective delusion with the precision of a documentary and the tone of a cartoon.

But the delusion had teeth. And the teeth showed up in production.

Before version two point six point zero, released in two thousand and fourteen, MongoDB accepted unauthenticated remote connections by default. Out of the box. No username. No password. No authentication of any kind. If you installed MongoDB, started it, and connected it to the internet, anyone in the world could connect to your database and read, modify, or delete everything in it. You did not even need to exploit a vulnerability. You just connected. The front door was open because there was no door.

By two thousand and fifteen, security researcher John Matherly, the creator of Shodan, a search engine that indexes internet-connected devices, reported over thirty thousand publicly exposed MongoDB instances. By the time the mass exploitation campaigns started in earnest, the number had grown to over two hundred thousand. Attackers would connect, copy or delete the data, and leave a ransom note in the now-empty database. The note would explain that if you paid a certain amount in Bitcoin, you might get your data back. Many victims paid. Most did not get their data back.

The deeper problem with NoSQL was not security. Security can be fixed with a configuration change. The deeper problem was theoretical, and a computer scientist named Eric Brewer had already explained it.

In two thousand, at the ACM symposium on principles of distributed computing, Brewer presented what became known as the CAP theorem. The claim was that a distributed data system can provide at most two of three guarantees. Consistency, meaning every read returns the most recent write. Availability, meaning every request gets a response. Partition tolerance, meaning the system keeps working even when network messages between nodes are lost or delayed. Since network partitions are inevitable in any real distributed system, you are really choosing between consistency and availability. You can have a system that is always consistent but sometimes unavailable during a partition, which is what relational databases do. Or you can have a system that is always available but sometimes inconsistent during a partition, which is what Dynamo did.

Seth Gilbert and Nancy Lynch formally proved Brewer's conjecture in two thousand and two, turning it from a keynote claim into a mathematical theorem. The NoSQL movement largely chose availability. Eventual consistency. Your data will become consistent eventually. Probably. In most cases. Unless it does not. This was presented as a tradeoff. What it turned out to be, in practice, was a source of bugs that were extraordinarily difficult to diagnose because they only appeared under specific timing conditions that were almost impossible to reproduce in testing.

In two thousand and ten, Daniel Abadi published the PACELC theorem, which pointed out something the CAP theorem had obscured. Network partitions are rare. They happen, but most of the time your distributed system is operating normally. During normal operation, the CAP theorem has nothing to say. The real tradeoff during normal operation is between latency and consistency. If you want low latency, you can skip the coordination step and return slightly stale data. If you want consistency, you coordinate, and that takes time. Most NoSQL databases were making the latency-consistency tradeoff during normal operation and calling it "eventual consistency" as if the partition scenario were the reason, when in reality they were just choosing speed over correctness all the time, not just during the rare moments when the network was degraded.

The practical consequences accumulated. Schema-on-read meant that the application was responsible for validating data, and applications are written by humans who forget edge cases. Data corruption became a feature of the architecture. The absence of joins forced developers to denormalize their data, storing the same information in multiple places, and then to perform joins in application code, which was slower, harder to maintain, and more error-prone than letting the database do it. The absence of transactions meant that operations that needed to update multiple records atomically, like transferring money between accounts, required elaborate application-level coordination that amounted to reimplementing transactions badly.

The trajectory was predictable. A generation of developers built systems in MongoDB because they were told it was faster and simpler. They discovered, gradually, that the relational constraints they had thrown away were actually doing useful work. ACID was not a limitation. It was a service. Schemas were not bureaucracy. They were documentation that the database enforced automatically. Joins were not slow. They were correct. And one by one, project by project, many of those developers migrated back to PostgreSQL, which by this point had JSONB and could store their schema-less documents anyway, inside a real relational database, with transactions and joins and all the other things they had briefly decided they did not need.

We used MongoDB because someone at a conference said Postgres couldn't handle our scale. Our scale turned out to be twelve thousand rows. Postgres would have been fine. Postgres would have been more than fine.

Not everyone came back. Cassandra, originally built at Facebook in two thousand and seven by Avinash Lakshman, who had co-authored the Dynamo paper, found a legitimate niche in workloads that genuinely needed massive write throughput across multiple data centers. Facebook itself later replaced Cassandra with HBase for the feature it was originally built for, inbox search, which is either an indictment of Cassandra or evidence that Facebook's needs changed faster than any database could track. CouchDB, built by Damien Katz starting in two thousand and five, pioneered offline-first replication for mobile applications, a genuinely useful capability that relational databases handled poorly. Each database found its niche. None of them replaced SQL.

The Redis Soap Opera

Salvatore Sanfilippo was born on March seventh, nineteen seventy-seven, in Sicily. His online alias was antirez. He was building an Italian startup called LLOOGG, a real-time web log analyzer. Think of it as a precursor to Google Analytics, but for the era when you could still build a web analytics product from your apartment and have a reasonable chance of attracting users. LLOOGG showed you, in real time, who was visiting your website, where they came from, and what they were looking at. The "real time" part was the problem. MySQL, which was powering the backend, could not keep up. The queries were too slow. The writes were too frequent. The data was inherently ephemeral, a running window of the most recent activity, and storing it in a relational database designed for permanent records was like using a filing cabinet to hold Post-it notes.

Sanfilippo's insight was almost offensively simple. Hold everything in memory. Do not write to disk unless someone asks. If the data fits in RAM, reading it takes nanoseconds instead of milliseconds. If you lose power, you lose the data, but for a real-time analytics dashboard showing the last ten minutes of activity, losing the data on a power failure is not a catastrophe. It is a minor inconvenience. You start collecting again.

He built a prototype in Tcl in roughly three hundred lines. It worked. He called it LMDB, for LLOOGG Memory DB. Then he rewrote it in C and renamed it. Remote Dictionary Server. Redis.

The first C version was released on February twenty-sixth, two thousand and nine. By June nineteenth of that year, Redis had completely replaced MySQL in LLOOGG's production environment. Four months from initial release to sole database. The speed difference was not incremental. It was categorical. Operations that took milliseconds in MySQL took microseconds in Redis. For LLOOGG's use case, this was the difference between an application that felt sluggish and an application that felt instantaneous.

Redis is, at its core, a data structure server. Not a database in the traditional sense. It holds strings, lists, sets, sorted sets, hashes, bitmaps, streams, and various other data structures in memory, and lets you perform operations on them with sub-millisecond latency. You can use it as a cache, putting frequently accessed data in Redis to avoid hitting your main database. You can use it as a message broker, with publishers and subscribers exchanging messages through Redis channels. You can use it as a job queue, with workers pulling tasks from a Redis list. You can use it as a session store, keeping user session data in memory for fast access. It is a Swiss Army knife for problems that need fast, temporary, or semi-permanent data access.

Sanfilippo maintained Redis for eleven years as its benevolent dictator for life. His relationship with the project was unusual in open source. He was not just the maintainer. He was the voice, the personality, the person who wrote thoughtful blog posts about data structures and responded to issues on GitHub with the patience and clarity of someone who genuinely enjoyed explaining things. The sponsorship chain went VMware from twenty ten to twenty thirteen, then Pivotal from twenty thirteen to twenty fifteen, then Redis Labs, later renamed Redis Limited, from twenty fifteen onward. Each transition moved Redis slightly further from its indie origins and slightly closer to a commercial product.

On June thirtieth, twenty twenty, Sanfilippo published a blog post titled "The end of the Redis adventure." He was stepping down as maintainer. His words were characteristically honest. He had spent eleven years on Redis. He wanted to write code. Not manage a project. Not review pull requests. Not mediate community disputes. Not be the person everyone emails when they have a feature request or a complaint or a question about licensing. He wanted to write software for the pleasure of writing it, without the obligation of maintaining it for the world.

I write code in order to express myself, and I consider what I code an artifact, rather than just something useful to get things done. This is the main mass of my mass-energy equivalence, and this is what I want to do with my mass.

The project continued without him. Redis Limited employed a team of developers who maintained the codebase. The community contributed. Everything was fine.

Until March of twenty twenty-four, when Redis Limited changed the license.

Redis had been BSD-licensed since its creation. The BSD license is one of the most permissive open source licenses in existence. You can do essentially anything with BSD-licensed software, including incorporating it into commercial products without contributing anything back. This was fine for fifteen years. Then it became a problem, or at least Redis Limited decided it was a problem, because Amazon Web Services, Google Cloud, and Microsoft Azure were all offering managed Redis services. They were taking Redis, running it on their infrastructure, selling access to it, and not paying Redis Limited anything. Legally, this was permitted by the BSD license. Commercially, it was eating Redis Limited's lunch.

Redis Limited switched the license to a combination of RSAL, the Redis Source Available License, and SSPL, the Server Side Public License. Neither of these is an open source license as defined by the Open Source Initiative. The practical effect was that cloud providers could no longer offer managed Redis services without a commercial agreement with Redis Limited. Amazon, Google, Oracle, and others would need to pay up or stop offering the service.

Within weeks, the response was devastating. Amazon, Google, Oracle, Ericsson, and the Linux Foundation forked the last BSD-licensed version of Redis, version seven point two point four, and created a new project called Valkey. The fork was immediate, well-organized, and backed by companies with more engineering resources than Redis Limited could ever match. Before the license change, twelve contributors who did not work for Redis Limited were responsible for fifty-four percent of all commits to the project. After the license change, the number of non-employee contributors with more than five commits dropped to zero. The community did not just leave. The community took the code and built a new house with it.

Sanfilippo returned to Redis Incorporated in November of twenty twenty-four. Redis eight point zero, released in May of twenty twenty-five, switched to a tri-license model that included AGPL as a third option, a partial correction that acknowledged the original licensing move had been too aggressive. Whether the community returns or stays with Valkey remains an open question. Forks, once established, are hard to undo. Ask Monty Widenius about MariaDB.

DuckDB: The New Kid

In Amsterdam, at the Centrum Wiskunde and Informatica, the Dutch national research institute for mathematics and computer science, two researchers named Mark Raasveldt and Hannes Muhleisen had a problem that was not a problem anyone else seemed to care about.

The R community loved data analysis. They lived in it. Downloading datasets, running statistical models, building visualizations, exploring patterns. But the moment a dataset got too large to fit comfortably in R's memory, the workflow broke. You needed a database. And databases were annoying. They required installation, configuration, server processes, connection strings, user management. Setting up PostgreSQL to analyze a CSV file felt like buying a shipping container to store a bicycle.

SQLite was embedded and required no setup, which was good. But SQLite was designed for OLTP, online transaction processing. It was optimized for the kind of workload a phone app generates. Insert a row. Read a row. Update a row. Small, frequent operations on individual records. Data analysis is the opposite of this. Data analysis is OLAP, online analytical processing. Scan an entire column. Compute an aggregate across millions of rows. Group by this, filter by that, join with something else, and give me the result. These are fundamentally different access patterns, and a database optimized for one will be mediocre at the other.

Raasveldt and Muhleisen worked in the database architectures research group at CWI. The same research institute where Guido van Rossum created Python. The same research institute where MonetDB, one of the pioneering column-store databases, was developed. They tried hacking MonetDB to work as an embedded analytical database. It did not go well. MonetDB was a server-based system with assumptions baked into its architecture that did not translate to the in-process model.

So they started over. Evenings. Weekends. Just hacking on something new for a couple of years. The result was DuckDB. First open-source version in twenty nineteen. DuckDB Labs spun off as a company in the summer of twenty twenty-one.

The positioning is elegant. SQLite for analytics. Everything that makes SQLite great for embedded transactional workloads, no server, no configuration, single file, runs in your process, DuckDB provides for embedded analytical workloads. You can open a CSV file, a Parquet file, a JSON file, and query it with SQL immediately. No import step. No schema definition. No loading phase. Just point DuckDB at the data and ask questions.

The company has declined all venture capital offers. Two million downloads per month. The category "in-process analytical database" did not exist before DuckDB. It exists now.

Your Databases, Par

We need to talk about Par Boman's databases now.

I say "databases" in the plural because there are eighteen of them. Eighteen. Running in production. Right now. Not on some vast corporate infrastructure with a team of database administrators and a budget for monitoring tools. On a single Scaleway VPS in a data center in Paris, and on his laptop, and scattered across his projects like a man who buys notebooks at every stationery shop he passes and then forgets which one has the shopping list.

Let us count them. Let us count all eighteen of Par Boman's databases.

Eight PostgreSQL databases on the VPS. The first is called "parkit." This is a shared database used by nine microservices. Nine. Par has nine microservices. They are called things like Capture, Focus, Time, and Stats. They are, collectively, a personal productivity system that he built himself. In the year twenty twenty-five, a man who could have downloaded Todoist decided to build nine microservices instead. And they all share one PostgreSQL database, because Par read somewhere that microservices should share nothing, thought about it for approximately fifteen seconds, and decided he disagreed.

If they all need the same data, why would I run nine databases? That's just nine things that can break instead of one.

He has a point. It is not a popular point. It contradicts a decade of distributed systems advice. But he has a point.

The second PostgreSQL database is called "live." This powers a multi-site news dashboard. Weather, police reports, events. It is called Arebladet Live, and it is part of his work on a local newspaper in northern Sweden. The third is "ttpanotis," a political advertising transparency tool that tracks political ads. This one uses SQLAlchemy and Alembic, which means it is the one project where Par let an ORM handle the schema migrations instead of writing them by hand. The fourth is "partypar," a party equipment rental catalog. Par rents out party equipment. The fifth is "parcel," a print-on-demand storefront backed by Printful. Par sells things. The sixth is "parpixel," a photographer portfolio and digital goods shop. Par takes photographs. The seventh is "listmonk," a newsletter service. Par sends newsletters. The eighth is planned but not yet deployed, called "archive," intended to hold one thousand nine hundred conversations with AI systems, searchable with pgvector semantic embeddings.

That is eight PostgreSQL databases on one VPS. One man. Eight databases. Seven of them running with hand-written SQL and numbered migration files that execute on startup. No ORM. No migration framework. No database administrator. Just a Swedish developer typing CREATE TABLE and ALTER TABLE ADD COLUMN into files named zero zero one underscore initial dot sql, zero zero two underscore add underscore timestamps dot sql, and so on, exactly the way Edgar Codd would have wanted, which is to say, with the schema explicitly declared and the data independent of the application code.

Now the SQLite databases. There are eight of these. They are everywhere.

"koma dot db" is a gut health tracker for his partner. It uses better-sqlite3 in Node. Par built a medical tracking application for the person he lives with. This is either very romantic or a sign that Par relates to people through databases. Possibly both.

"stats dot db" is a web analytics system. Privacy-first. IP addresses are hashed. Par built his own analytics system rather than use an existing one. His analytics system has its own database. This is not unusual for Par.

"jobs dot db" is a background job queue for a photo booth AI. Par has a photo booth AI. It transforms photos using AI models. The jobs waiting to be processed sit in a SQLite database.

"focus dot db" is a local development copy of one of his PärKit microservices. This means Par has the same data in both PostgreSQL on the VPS and SQLite on his laptop, for development purposes. Two databases doing one job. We are at twelve and climbing.

"inventory dot db" tracks electronic components. Par builds electronics projects. He has enough components to need a database to track them. He has a database to track his resistors.

"storyteller dot db" is for an AI creative writing desktop application. Par built a creative writing application and gave it a database. "napkincast dot db" is for a podcast purchase service that is not even deployed yet. It does not exist as a running service. It has never served a single customer. It has never processed a single transaction. But it already has a database, because Par does not build applications. Par builds databases and then wraps applications around them.

Then there is DuckDB. One of them. Inside something called LifeLab version three, which Par describes as an "investigation tool." It contains twenty-seven thousand four hundred and forty-four photos and uses an embedded column store for analytical queries across the photo metadata. We are not going to ask what Par is investigating with twenty-seven thousand photos. We are going to respect the boundary between curiosity and privacy and move on.

And finally, Redis. One instance. Running as a job queue for the photo booth AI transformations. No persistent data. Redis being used as a temporary holding area for work that needs doing. This is Redis as its creator intended. Sanfilippo would approve.

I know it sounds like a lot. But each one has a reason.

They do all have reasons. That is not the point. The point is that Par Boman, a single developer working on personal projects and a small newspaper, has accumulated more database instances than most startups that have raised a Series A. He has more databases than some companies that employ database administrators. He has PostgreSQL for the things that need to be shared, SQLite for the things that live on one machine, DuckDB for the one time he needed analytics, and Redis for the one time he needed a queue. This is, accidentally, a near-perfect architecture. It follows every best practice that the database community has spent fifty years converging on. Use an embedded database for single-application use. Use a server database for shared access. Use a specialized engine for specialized workloads. Do not use more technology than the problem requires.

And here is the beautiful irony. The beautiful, full-circle, fifty-year irony that ties this entire two-part series together.

Par runs raw SQL everywhere. No ORM in most of his projects. Hand-written schema migrations in numbered files. He writes CREATE TABLE. He writes SELECT. He writes JOIN. He writes WHERE. He types the actual keywords of the language that Don Chamberlin and Ray Boyce designed at IBM in nineteen seventy-four, the language that Edgar Codd considered a compromised implementation of his mathematical vision, the language that Larry Ellison shipped as Version Two because Version One sounded too risky.

Par types these keywords and the data comes back and he does not think about what is happening underneath. He does not think about query planning or B-tree indexes or write-ahead logging or multiversion concurrency control. He does not need to. That is the entire point. That is what Codd was trying to achieve in those thirteen pages. Data independence. You describe what you want. The machine figures out where it lives and how to get it. Par is living in the future that Codd designed, using it so naturally that it feels like electricity, invisible infrastructure that has always been there and always worked.

And it cost him exactly zero dollars in Oracle licensing fees. Zero dollars in database support contracts. Zero dollars in compliance audit settlements. He has eighteen databases, and the total annual cost of the database software is the same amount that Edgar Codd charged for his thirteen-page paper. Nothing. The open source ecosystem that grew from Berkeley and spread through a Swedish programmer naming things after his daughters has given Par, and millions of developers like him, the full power of Codd's relational model, for free, maintained by communities that cannot be acquired, running on software that cannot be taken away.

Oracle would like a word. Oracle always wants a word. Oracle would like to explain the different components of value and how customers have no choice but to pay. But Par has a choice. And Par chose PostgreSQL and SQLite and a small database written by a man on a warship who was tired of dialog boxes that said "cannot connect to database server."

What You Actually Need to Know

We have covered a lot of ground across these two episodes. Filing cabinets to warships. Mathematicians to license auditors. Let us extract the practical wisdom from all of this history, because history is interesting but not useful unless it changes how you act.

If your data fits on one machine, and almost everyone's data fits on one machine, you probably do not need distributed anything. The NoSQL movement was built on the scaling challenges of Google and Amazon. Google processes billions of searches per day. Amazon handles millions of transactions during peak shopping events. You are not Google. You are not Amazon. You are Par Boman with eighteen databases on a single VPS in Paris and the whole thing runs fine. The machine is not even working hard. The vast majority of applications will never hit the scaling limits that motivated Bigtable and Dynamo, and the complexity cost of pretending you will hit those limits is enormous.

SQLite is almost certainly the right choice for your side project. If your application runs on one machine and does not need concurrent access from multiple processes, SQLite is faster to set up, simpler to maintain, easier to back up, and more reliable than any server-based database. The backup procedure is "copy the file." The migration procedure is "run the SQL." The deployment procedure is "the database is already there because it is part of the application." Richard Hipp built it to work on a warship. It can handle your to-do app.

PostgreSQL is almost certainly the right choice for your server. If you need multiple applications to access the same data, or concurrent users, or replication, or any of the features that a real server database provides, PostgreSQL has been the answer for a decade and the gap between PostgreSQL and everything else is widening, not narrowing. JSONB handles document workloads. pgvector handles similarity search. PostGIS handles geographic queries. Full-text search handles search. The extension system handles whatever nobody has thought of yet. One database. Already backed up. Already monitored. Already understood by your team.

If someone tells you that you need MongoDB, ask why PostgreSQL's JSONB will not work. There are legitimate answers to this question. MongoDB's sharding and replication model is mature and well-understood for specific high-throughput document workloads. MongoDB Atlas provides a managed service with global distribution. But "we want to store JSON" is not a legitimate answer, because PostgreSQL stores JSON. "We do not want to write a schema" is not a legitimate answer either, because you are going to need to validate your data somewhere, and the database is better at it than your application code.

ORMs are fine until they are not, and raw SQL is fine always. An ORM buys you convenience and portability at the cost of control and visibility. When the ORM generates the wrong query, and it will, you need to understand enough SQL to read the generated query, understand why it is wrong, and fix it. If you already know enough SQL to do that, you already know enough SQL to skip the ORM. Par writes raw SQL and numbered migration files and his databases work and his schemas are clear and his queries do what he expects. This is not a sophisticated approach. It is the approach that works.

Back up your databases. Par does this. He has a daily backup script that snapshots his PostgreSQL databases and copies the SQLite files. He stores them off-site. He has tested the restore process. This is worth mentioning because it is the single most important thing you can do with a database and it is the thing most developers do last. The database is your application's memory. Everything else can be rebuilt from code. The data cannot be rebuilt from anything. If you lose it, it is gone. Your users' data, your business records, your analytics history, the gut health tracking data for your partner. Gone.

The database is the last thing you should optimize and the first thing you should understand. Most performance problems are not database problems. They are application problems. Bad queries, missing indexes, unnecessary joins, loading more data than you need. Understanding what the database is doing, at a conceptual level, is the highest leverage skill you can develop as a developer. You do not need to understand B-tree balancing algorithms. You need to understand that a query that scans every row is slower than a query that uses an index, and that adding an index makes writes slower to make reads faster, and that these are tradeoffs you should make consciously rather than accidentally.

If you don't know what your query is doing, you don't know what your application is doing. And if you don't know what your application is doing, you shouldn't be shipping it.

Conclusion: The Mathematician and the Filing Cabinet

Edgar Frank Codd died on April eighteenth, two thousand and three, in Williams Island, Florida. He was seventy-nine. He had been a mathematician, a pilot, an IBM researcher, and the inventor of the relational model. He spent his last years frustrated that the databases bearing the word "relational" did not truly implement his mathematical vision. SQL allowed duplicate rows. NULL handling was a mess. The purity of set theory had been compromised for practical reasons that he considered unnecessary.

He was right about the compromises. He was also living proof that being right does not mean the world listens. The databases he inspired did not follow his blueprint exactly. They followed it approximately. And the approximate version turned out to be good enough to run the world.

Every database Par has ever used traces its lineage directly to those thirteen pages published in nineteen seventy. The mathematical precision of a British RAF navigator who saw the future in set theory. The filing cabinet problem that Codd solved was not just a nineteen sixties inconvenience. It was a fundamental question about the relationship between data and the programs that use it. Should programs know where data lives? Codd said no. Programs should describe what they want. The machine should figure out the rest.

Fifty-six years later, Par types SELECT and the data comes back. He does not think about where it lives. He does not navigate through pointers. He does not follow paths through a hierarchy. He describes what he wants, in a language that is a compromised version of a mathematical formalism designed by a man who hated the compromises, and the machine figures out the rest.

Richard Hipp put a database in a warship and then in every phone on earth. Michael Stonebraker built the same thing twice and improved it both times. Monty Widenius named databases after his daughters and forked his own creation when it was taken from him. Salvatore Sanfilippo held everything in memory because disk was too slow and then walked away because maintenance was too much. Mark Raasveldt and Hannes Muhleisen built a database in Amsterdam because R users deserved better tools.

And at the beginning of it all, in a corporate research lab in San Jose, a former RAF navigator who flew Sunderland flying boats over the Atlantic wrote thirteen pages of mathematics that turned filing cabinets into something that could remember anything, find anything, and never lose anything, as long as someone remembered to back it up.

That is a good pedigree for something you barely think about. That is a great pedigree for eighteen databases on a VPS in Paris, doing exactly what they were designed to do, fifty-six years after a mathematician described them in a language that the industry bent slightly out of shape and then standardized anyway.

Par's databases are fine. All eighteen of them. They are the product of sixty years of argument, compromise, genius, greed, and open source defiance. They work. They will keep working. And every time he types SELECT and the data comes back, the ghost of Edgar Codd almost smiles. Almost. He would smile more if Par's queries handled NULLs correctly.

But that is a story for another time.