It's All Related: The Art of Splitting Things Apart

The Table That Lied

This is episode four of The Vibecoder's Guide to Postgres.

Imagine you are running an online store. You have been vibing with your AI assistant, and it has built you a beautiful little application with a database behind it. Business is going well. Orders are coming in. And then one day, a customer emails you. She moved last month, but the package went to her old address. You check the database, and you find something strange. Her new address is on her latest order, but her old address is still attached to three previous orders. That makes sense, those were shipped to the old place. But when you look at the customer record in your orders table, you realize there is no separate customer record. Her name, her email, and her address are just columns on the orders table. Every single order has its own copy of who she is.

So now you have a question. What is her real address? The one on order number forty-seven? Or order number fifty-one? If you want to send her a marketing email, which row do you trust? If she calls to update her phone number, do you update it on every single order she has ever placed? What if you miss one?

This is not a hypothetical disaster. This is what happens when data that belongs together gets smeared across rows that do not need it. And it is exactly the kind of schema that AI assistants love to generate. Ask a language model to design a database for an online store, and there is a very good chance you will get one big flat table with the customer name and email repeated on every order. It looks clean. It works for the first demo. And it is a ticking time bomb.

Today we are going to talk about why the entire point of a relational database is to split data across multiple tables and then bring it back together when you need it. Foreign keys, joins, normalization. The words sound academic, but the idea is simple and it solves the exact problem we just described. This is also the episode where vibe coding gets genuinely hard, because the thing AI struggles with most in database design is understanding that data has relationships.

The Man Who Saw Tables

[serious] To understand why databases work the way they do, you need to meet the person who invented the idea. His name was Edgar Frank Codd, though everyone called him Ted, and his life story reads like a spy novel with mathematics.

Codd was born in nineteen twenty-three on the Isle of Portland, a windswept limestone peninsula off the coast of Dorset, England. His father manufactured leather. His mother was a schoolteacher. He studied mathematics and chemistry at Exeter College, Oxford, and when the Second World War broke out, he became a pilot in the Royal Air Force Coastal Command, flying Sunderland flying boats on patrol over the Atlantic.

After the war, Codd crossed the Atlantic in a different way. He moved to New York in nineteen forty-eight and joined IBM as a mathematical programmer, working on one of the company's earliest electronic computers. But in nineteen fifty-three, something happened that had nothing to do with computers. [serious] Senator Joseph McCarthy was running his anti-communist hearings, and Codd, a British citizen with strong opinions, left the United States in protest. He moved to Ottawa, Canada, and ran a data processing department there for several years. He only came back to America after a chance meeting with his old IBM manager convinced him to rejoin the company.

By the late nineteen sixties, Codd had earned a PhD from the University of Michigan, studying under John Henry Holland, the father of genetic algorithms. He had transferred to IBM's San Jose Research Laboratory in California. And he was increasingly frustrated by the way databases worked.

In nineteen seventy, every serious database was either hierarchical or networked. Think of a file cabinet where every drawer leads to another drawer, and the only way to find what you want is to know the exact path through the drawers. If you wanted customer data and their orders, you had to navigate a tree structure, following pointers from node to node. If someone reorganized the tree, every program that used the database broke. The physical layout of the data and the logical meaning of the data were tangled together, and Codd thought this was fundamentally wrong.

In June nineteen seventy, Codd published a paper called "A Relational Model of Data for Large Shared Data Banks" in the Communications of the ACM. It was eleven pages long. It used the language of set theory and predicate logic. And it proposed something radical. Instead of navigating trees of pointers, you would store data in simple tables. Rows and columns, like a spreadsheet. The relationships between tables would be defined by shared values, not physical connections. And you would query the data by describing what you wanted, not by specifying how to get it.

Codd had a wonderful dry wit about his work. Years later, when asked about the name "normalization," a concept he introduced in that same paper, he said something perfect.

[calm] At the time, Nixon was normalizing relations with China. I figured that if he could normalize relations, then so could I.

IBM's reaction to this revolutionary paper was, to put it politely, unenthusiastic. The company was making enormous amounts of money selling IMS, its hierarchical database. Codd's relational model was a direct threat to that revenue. IBM sat on the idea for four years before reluctantly funding an experimental project called System R. And even then, they pulled Codd off the project based on his own work, for reasons that have never been fully explained.

It did not matter. Other companies saw the paper and ran with it. Larry Ellison read about System R in an academic journal and built a product around the idea. He called it Oracle. Codd won the Turing Award in nineteen eighty-one. The relational model conquered the world. And the core idea, the one that makes it all work, is what we are talking about today. Relationships between tables.

Why You Split Data Apart

Here is the fundamental insight, and it is so simple it almost seems too obvious once you hear it. If a piece of information describes one thing, it should be stored in one place.

Your customer's name describes the customer. Not the order. Not the invoice. Not the shipping label. The customer. So the customer's name belongs in a customers table, stored once, in one row. When an order needs to know who placed it, it does not store a copy of the customer's name. It stores a reference. A pointer that says "this order belongs to customer number forty-two."

That reference is called a foreign key. And foreign keys are the connective tissue of every relational database.

Think of it like a library. A library does not photocopy the author's biography and staple it inside every book they wrote. The library has a card for the author, and each book has the author's name on it. If the author changes their pen name, you update one card, not three hundred books.

In Postgres, a foreign key is a column in one table that points to the primary key of another table. When you create the connection, Postgres makes you a promise. It will never let an order reference a customer who does not exist. Try to insert an order for customer number nine thousand when there is no customer nine thousand, and Postgres will refuse. Flat out reject the operation. Try to delete a customer who still has orders, and Postgres will refuse that too, unless you have explicitly told it what to do in that case.

This is called referential integrity, and it is one of those boring-sounding things that saves you at three in the morning. Without it, you get orphaned records. Orders pointing to customers who have been deleted. Comments attached to blog posts that no longer exist. A calendar event linked to a task that someone removed last Tuesday. The data looks fine until you actually try to use it, and then you discover that half your references point to ghosts.

The Vibecoder's Foreign Key Problem

Here is where vibe coding enters the story. If you ask an AI assistant to generate a database schema, it will often produce something that looks perfectly reasonable but has no foreign keys at all. Or it creates the columns that should be foreign keys but does not add the constraint. The column is there, it has a name like customer underscore id, but nothing in the database actually enforces the relationship.

Why does this happen? Because language models optimize for what looks correct, not what is correct. A schema without foreign key constraints will work perfectly for your first hundred rows of test data. Every insert succeeds. Every query returns results. The AI has produced code that runs without errors, which is its definition of success. But it has not produced code that protects your data, because protection is about what should not be allowed to happen, and language models are not naturally good at reasoning about absence.

[angry] I spent three weeks debugging why our reports showed orders from deleted accounts. Turned out the AI never added the foreign key constraint. The column was there. The data was there. The constraint was not.

The fix is straightforward, and it is something every vibecoder should learn to check. After your AI generates a schema, look for the foreign key constraints. Are they actually there? Do they have the right behavior for deletes? When a customer gets removed, should their orders cascade delete too, or should Postgres refuse the deletion until the orders are handled first? These are design decisions that the AI cannot make for you, because they depend on what your application actually means.

Bringing It Back Together

So you have split your data apart. Customers in one table, orders in another, products in a third. The data is clean, each fact stored once, no duplication, no ambiguity. But now you have a problem. When you want to display an order, you need the customer's name and the product's description and the order total all at once. The data is in three different tables.

This is where joins come in. A join is an operation that combines rows from two or more tables based on a related column between them. You are telling Postgres to look at two tables, find the rows that match, and give you a combined result. It is the inverse operation of normalization. You split the data apart for storage and bring it together for queries.

The most common join is the inner join. Think of it like a guest list for a party. You have a table of people who were invited, and a table of people who showed up. An inner join gives you only the people who were both invited and actually came. Both tables must have a matching row, or you get nothing. In database terms, if an order references customer forty-two, an inner join between orders and customers will only include that order if customer forty-two actually exists in the customers table. If the customer has been deleted somehow, that order just vanishes from the result. Silently. Which is why foreign keys matter, because they prevent this invisible data loss from being possible in the first place.

The left join is different, and arguably more useful in practice. A left join says "give me everything from the first table, and if there is a match in the second table, include that too. If there is no match, fill in the blanks with nothing." Imagine that same party, but now you want the complete invite list. Everyone who was invited, plus a note about whether they showed up. People who were invited but did not come still appear on your list, they just have a blank in the attendance column.

This is the join you will use most often. Show me all customers and their orders. Customers who have not ordered yet still appear, they just have no order data attached. Show me all products and their reviews. Products with no reviews still show up, you just see empty review columns. The left join is how you ask the question "who is missing?" without losing anyone.

There is also a right join, which is the mirror image. Everything from the second table, with matches from the first. In practice, almost nobody uses right joins because you can always rearrange a right join into a left join by swapping the table order, and left joins are easier to read. Some database professionals go their entire careers without writing a right join.

And then there is the full outer join, which is both sides at once. Give me everything from both tables, matched where possible, with blanks where there is no match. This one is rare. You use it when you genuinely want to see all the data from both tables and identify what does not match up on either side. An audit tool, essentially. Most applications never need it.

Rabbit Hole: The Normalization Ladder

This next section goes deep into the theory of normalization. If you just want to know the practical takeaway, skip ahead to the next chapter. But if you want to understand why Codd's system works the way it does, and why your AI's schemas keep getting it wrong, stick around.

Normalization is the process of organizing your database to reduce redundancy and prevent anomalies. Codd introduced the first normal form in his nineteen seventy paper, and then expanded it to second and third normal form in a follow-up paper in nineteen seventy-one called "Further Normalization of the Data Base Relational Model." A few years later, working with a brilliant young IBM researcher named Raymond Boyce, he refined the rules into what is now called Boyce-Codd Normal Form.

[sad] Raymond Boyce, by the way, deserves his own moment. He co-invented SQL alongside Donald Chamberlin at IBM. They originally called it SEQUEL, for Structured English Query Language, but had to change the name because a British aircraft company called Hawker Siddeley had trademarked the word. Boyce was twenty-seven years old, recently married, with a ten-month-old daughter named Kristin. On June eighteenth, nineteen seventy-four, he collapsed at the IBM research lab from a brain aneurysm and died. He had co-invented the query language the entire world would use, helped define how databases should be structured, and did not live to see any of it deployed. <break time="1s"/>

So what are these normal forms? Think of them as levels of cleanliness for your data, where each level removes a specific type of problem.

First normal form is the simplest. Every cell in your table must hold a single value. No lists, no nested structures, no field that says "running shoes, wool socks, rain jacket" in a single column. If a customer has three phone numbers, you do not cram them into one field separated by commas. You either create a separate phone numbers table, or at minimum, have separate columns for each. The key word is atomic. Each piece of data is indivisible.

Second normal form builds on first. It says that every non-key column must depend on the entire primary key, not just part of it. This one mostly matters when you have a composite key, a primary key made of two or more columns. Imagine a table tracking which students take which courses, with a grade. The primary key is the combination of student and course. The student's name depends only on the student, not on the course. So the student's name should not be in this table. It violates second normal form because it only depends on part of the key. Move it to a students table.

Third normal form goes one step further. No non-key column should depend on another non-key column. If your orders table has a customer ID, a customer name, and a customer city, then customer name and customer city depend on customer ID, not on the order. They are along for the ride. They should be in the customers table, linked by that foreign key. The classic mnemonic is that each column must provide a fact about the key, the whole key, and nothing but the key.

In practice, third normal form is where most working databases land. There are higher forms, fourth, fifth, and Boyce-Codd, but they address increasingly rare edge cases. If you get your schema to third normal form, you have eliminated the vast majority of data redundancy and anomaly problems. That is the practical target.

And here is the thing that matters for vibecoders. When you understand these three levels, you can look at an AI-generated schema and immediately spot the problems. Oh, the customer's email is on the orders table? That is a second or third normal form violation. There should be a separate customers table. The product description is copied onto every order line item? Same problem. You do not need to memorize the formal definitions. You just need the instinct that says "this data belongs somewhere else."

The NoSQL Wars

For about six years, a significant portion of the software industry tried to pretend that everything we just discussed was unnecessary. They called it the NoSQL movement, and it is one of the most fascinating collective mistakes in the history of technology.

The year was two thousand nine, and a few of the largest companies in the world had a genuine problem. Facebook, Google, and Amazon were operating at scales that broke traditional relational databases. Not because relational theory was wrong, but because the existing implementations could not handle the sheer volume. Google built BigTable. Amazon built Dynamo. Facebook built Cassandra. These were specialized tools for specialized problems, and they worked brilliantly for those problems.

But then something happened that always happens in technology. People saw what the giants were doing and assumed it applied to them too. If Google does not use joins, maybe I should not use joins either. If Amazon built a key-value store, maybe I should build on a key-value store too. Never mind that your application has four hundred users and not four hundred million.

MongoDB arrived in two thousand nine, and it hit the developer world like a freight train. The pitch was seductive. No schema. No migrations. No foreign keys. Just throw your data in as JSON documents and go. It was incredibly easy to start with. You did not have to think about table design or normalization or any of the things Ted Codd had spent his career formalizing. You just stored documents.

[laugh] The viral "MongoDB is Web Scale" video, created with the Xtranormal animation tool around twenty ten, captured the absurdity of the hype perfectly. It featured two cartoon characters arguing about databases, with the MongoDB fan shouting the phrase "it's web scale" as the answer to every objection, while the other character tried in vain to discuss actual engineering tradeoffs. The phrase "web scale" itself had been a genuine marketing term that NoSQL evangelists used, and after the video went viral, most of them quietly stopped.

But plenty of real companies were already committed. Digg, the social news site, decided to denormalize their entire database and migrate to a NoSQL solution. The engineering team reasoned that since the data was no longer relational, there was no point using a relational database. What they discovered was that denormalized data bloats. Their dataset exploded to three terabytes, and they needed constant "fixup" jobs running in the background to correct inconsistencies in the duplicated data.

The open-source social network Diaspora launched on MongoDB in twenty ten because social data seemed perfect for a document store. Within eight months, they switched to MySQL. A developer named Sarah Mei wrote a now-famous blog post in twenty thirteen called "Why You Should Never Use MongoDB," drawing on the Diaspora experience.

[serious] Social data is relational. Some of your domain objects are the same type, and they need to be linked to each other. A document database does not naturally express that relationship.

Her core insight was devastating in its simplicity. Almost all real-world data is relational. Your users have orders. Your orders have products. Your products have categories. Your categories have parent categories. The moment you have two things that need to reference each other, you have relational data, and you are fighting the document model instead of working with it.

The Comeback

By the mid twenty-tens, the tide was turning. Company after company discovered that their denormalized, schema-free databases had created exactly the problems that Codd's normalization rules were designed to prevent. Data duplication, inconsistent records, and no way to enforce integrity at the database level.

In twenty sixteen, Uber published a blog post explaining why they had switched from Postgres to MySQL, building a custom "schemaless" layer on top. The post criticized Postgres harshly, and the database community erupted. But the interesting part was not the Postgres versus MySQL debate. It was that Uber's solution involved building an entirely new abstraction layer to handle the data integrity problems that foreign keys solve for free.

The numbers tell the story. According to the Stack Overflow Developer Survey, PostgreSQL is now the most used database in the world, at over fifty-five percent of professional developers. MongoDB sits at around twenty-five percent. More importantly, PostgreSQL is consistently rated the most loved database, with relatively few developers saying they want to stop using it.

The NoSQL movement was not a total loss. It taught the relational world important lessons about horizontal scaling, about flexible schemas for genuinely unstructured data, and about developer experience. MongoDB itself has matured enormously and is a good tool for specific use cases. But the core lesson, the one that matters for you as a vibecoder, is that Codd was right in nineteen seventy and he is still right today. If your data has relationships, use a relational database, and use foreign keys to enforce them.

The N Plus One Trap

There is one more thing you need to know about joins, and it is probably the most common performance mistake in web applications built with AI assistance. It is called the N plus one query problem.

Here is how it works. You want to display a list of ten blog posts with each author's name. A naive approach, and the approach that most ORMs and AI-generated code will produce, does this. First, it runs one query to get the ten posts. Then, for each post, it runs a separate query to get that post's author. That is one query for the list plus ten queries for the authors. Eleven queries to answer a question that a single join could answer in one.

Scale it up. A hundred posts? One hundred and one queries. A thousand posts? One thousand and one queries. Each query has overhead, the round trip to the database, parsing the SQL, building the plan, returning the result. A single join query that says "give me all posts and their authors at once" does the same work in one trip, and Postgres is spectacularly good at it. The query planner can choose from three different join strategies, nested loop, hash join, and merge join, picking the most efficient one for your specific data distribution.

This is where the "joins are slow" myth comes from, incidentally. Joins are not slow. Badly written joins with missing indexes are slow, the same way driving is not slow but driving with the parking brake on is. A properly indexed join between two million-row tables executes in milliseconds. The nested loop join, which is the most common type in normal web applications, is extremely efficient when the inner table has an index on the join column.

[excited] We were making four hundred database calls per page load. Replaced them with six joins. Page load went from two seconds to eighty milliseconds. Joins are not the problem. Not joining is the problem.

Vibecoders have a particular vulnerability here because AI assistants generate code that interacts with databases through ORMs, object-relational mappers, which abstract away the SQL. The ORM makes it easy to traverse relationships in code. You write post dot author dot name, and it feels like accessing a property on an object. But behind the scenes, every dot might be a new query. The AI does not see this. It generates code that reads beautifully and performs terribly.

The fix is to tell the ORM to eager load the related data, which translates to using a join under the hood. In Django, you use select related. In Rails, it is includes. In SQLAlchemy, it is joinedload. This is one of those cases where you as the human need to understand what is happening at the database level, even if you never write SQL by hand, because the AI will not flag the performance problem for you.

A Real Database, Properly Split

Let me show you what good normalization looks like in the real world. On our Postgres server, there is a database called parkit that powers a productivity suite. It has three services, Capture for collecting ideas, Focus for executing on them, and Time for calendar management. They all share one Postgres database with properly normalized tables.

The projects table stores projects. Just the project name, its color, its status. Each project gets a unique ID. The ideas table stores captured items, tasks, notes, voice memos. Each idea can optionally reference a project through a foreign key. An idea can also reference a milestone, which itself references a project through another foreign key.

Now here is where normalization pays off. The focus sessions table tracks what you do with each idea. Did you complete it, snooze it, skip it? Each session references an idea through a foreign key. The comments table lets you annotate ideas, again linked by foreign key. The mode assignments table routes ideas to different workflow modes, again through foreign keys.

None of these tables duplicate data. The idea's text lives in one place. The project's name lives in one place. If you rename a project, it is one update to one row, and every idea, milestone, session, and comment that references that project automatically reflects the new name. There are no fixup jobs. There are no stale copies. The foreign keys guarantee that every reference points to something real.

And when you need to see everything together, a single query with a few joins assembles the complete picture. Show me all ideas for the "tooling" project that were completed this week, with their comments and session history. Postgres handles the join across four tables in a few milliseconds, because each foreign key column is indexed, and the query planner knows exactly how to combine them.

Compare this to what would happen with a denormalized design. Every idea would carry a copy of the project name, the mode label, the milestone title. Rename a project? You need to update every idea that references it. Hundreds or thousands of rows. Miss one, and your data is inconsistent. Delete a project? Now you have ideas floating in space, referencing a project name that exists nowhere else. The foreign key constraint makes that impossible. Postgres simply will not let you delete a project that still has ideas attached to it, not without explicitly telling it how to handle the orphans.

Your Data Is Organized, But Is It Fast?

Here is where we leave things for today. You now understand why relational databases split data across tables. You understand foreign keys and why they matter. You can recognize the most common join types and know which one to reach for. You understand normalization at a practical level, enough to spot when an AI-generated schema has smeared data across tables where it does not belong. And you know about the N plus one trap, the performance landmine that hides inside every ORM.

[sad] Ted Codd died on April eighteenth, two thousand three, at his home in Florida. He was seventy-nine. He had watched his idea go from an eleven-page paper that his own employer tried to suppress, to the foundation of a multi-billion dollar industry. Ray Boyce, who helped formalize the normalization rules and co-invented the query language you still use today, never got to see any of it. He was twenty-seven. <break time="1s"/>

[serious] The relational model survived the hierarchical databases of the seventies, the object databases of the nineties, and the NoSQL rebellion of the twenty-tens. It survived because Codd's core insight was correct. Data has relationships. A database should enforce them.

But organized data is not the same as fast data. In the next episode, we are going to talk about the thing that makes Postgres queries go from glacial to instant. Indexes. The chapter markers in the book of your data. And we will see why AI assistants add them everywhere, or nowhere, but almost never in the right place.