Reading the Signs: Your Database Is Trying to Tell You Something

The Query That Should Be Fast

This is episode nine of The Vibecoder's Guide to Postgres.

[worried] Something is wrong. You have a query. It is a simple query, a join between two tables with a WHERE clause and a LIMIT. The kind of thing you have written a hundred times, the kind of thing your AI assistant generated in two seconds flat. It should take milliseconds. It is taking eleven seconds. Your users are staring at a loading spinner. Your monitoring dashboard is turning orange, then red. And you have no idea why.

You do what any vibecoder would do. You paste the query into your AI chat and say, "why is this slow?" The AI gives you a perfectly reasonable answer. It suggests adding an index. You add the index. The query still takes eleven seconds. The AI suggests rewriting the join. You rewrite the join. Still eleven seconds. The AI suggests increasing work memory. You do that. Nothing changes.

Here is the problem. Your AI assistant is guessing. It cannot see your tables. It does not know how many rows they have, how the data is distributed, whether your indexes are actually being used, or what the disk is doing. It is giving you textbook answers to a question that requires a physical examination.

What you need is an X-ray. You need to see inside the database, to watch it work, to understand the specific sequence of steps Postgres is taking to answer your query and why those steps are taking so long. [excited] Postgres has this X-ray machine. It is called EXPLAIN ANALYZE. And learning to read its output is the single most practical skill a vibecoder can develop, because it transforms database performance from a mystery into a science.

But before we learn to read the X-ray, we should understand the organ it is imaging. The query planner. And to understand the query planner, we need to go back to a research lab in San Jose, California, in nineteen seventy-nine, where a mathematician named Patricia Selinger was about to prove that computers could think about data better than humans.

Rabbit Hole: The Woman Who Taught Databases to Think

This next section goes deep into the history of query optimization. If you just want to learn how to read EXPLAIN output, skip ahead to the next chapter. But if you want to understand why your database makes the decisions it does, this is where it starts.

In nineteen seventy, Edgar Codd published his paper on the relational model. We covered that in episode two. But Codd's paper had a problem that nobody talked about. It described what a relational database should do, the logical operations, the set theory, the elegant algebra. What it did not describe was how to do it fast.

If you tell a database "give me all orders from customers in Sweden that were placed in the last thirty days," there are dozens of ways to actually fetch that data. You could scan every row in the orders table and check each one. You could look up Swedish customers first and then find their orders. You could start with recent orders and filter by country. You could use an index on the date column, or an index on the customer column, or both. Each approach gives the same answer, but some take milliseconds and others take minutes. The question of which approach to pick is called query optimization, and in the nineteen seventies, the prevailing wisdom was that humans had to make this choice. You, the programmer, had to tell the database exactly how to retrieve the data. The idea that a computer could figure out the best strategy on its own was considered unrealistic.

Patricia Selinger disagreed. She had earned three degrees from Harvard in applied mathematics, finishing her doctorate in nineteen seventy-five, and she joined IBM's San Jose Research Laboratory the same year to work on System R, the first serious attempt to build a working relational database. Her job was to solve the optimization problem. To build a component that could look at a SQL query, consider every possible way to execute it, estimate how expensive each way would be, and pick the cheapest one. Automatically. Without human guidance.

[excited] In nineteen seventy-nine, she and four colleagues published a paper at the SIGMOD conference called "Access Path Selection in a Relational Database Management System." It is one of the most influential papers in the history of computer science. Not because it is long or complex, but because it introduced an idea that changed everything. Cost-based query optimization.

Database management systems can support non-procedural query languages with performance comparable to those supporting the current more procedural languages.

The paper described a system that kept statistics about the data in each table. How many rows, how many distinct values in each column, how many pages on disk. When a query arrived, the optimizer would enumerate the possible execution strategies, use those statistics to estimate the cost of each one, measured in disk reads and CPU work, and choose the strategy with the lowest estimated cost.

The cost model was elegant. It boiled everything down to a single formula. The cost of a query plan equals the input-output cost plus a weight factor times the CPU cost. Disk reads are expensive. CPU work is cheap. Minimize the total, and you minimize the time. The optimizer used dynamic programming to explore join orders, meaning it could find the best way to combine three, five, even fifteen tables without trying every single permutation. For a query involving eight tables, the optimization itself took less than a second on the hardware of the day, an IBM three seventy.

The dynamic programming algorithm for determining join order that Selinger proposed in that paper still forms the basis for most of the query optimizers used in modern relational systems. Every time you run a query in Postgres, in MySQL, in Oracle, in SQL Server, the engine is using a descendant of Patricia Selinger's nineteen seventy-nine optimizer. She later became an IBM Fellow, the company's highest technical honor, won the SIGMOD Innovations Award, and was elected to the National Academy of Engineering. But her most lasting contribution is invisible. It runs billions of times a day, in every database on earth, making decisions about how to fetch your data. And it works so well that most people never think about it at all.

Until it makes a bad decision. And then you need to understand what it is doing and why. That is where EXPLAIN comes in.

What Happens Before Your Query Runs

When you send a SQL query to Postgres, it does not just run it. The query passes through four stages, like a letter going through a mail sorting system, before a single row of data is touched.

First, parsing. Postgres reads your SQL text, breaks it into tokens, checks that the syntax is valid, and produces an internal tree structure representing what you asked for. This is where typos and syntax errors get caught.

Second, rewriting. Postgres applies any rules that transform the query. If you are querying a view, this is where the view's definition gets substituted in. If there are row-level security policies, they get woven into the query here. You asked for a simple table, but after rewriting, you might be querying a complex subquery without knowing it.

Third, planning. This is Selinger's legacy. The planner looks at the rewritten query and generates multiple possible execution plans. For each table, it considers different ways to scan the data. For joins, it considers different strategies and different orderings. For each possible plan, it estimates the cost using statistics about your data. Then it picks the plan with the lowest estimated cost.

Fourth, execution. Postgres takes the chosen plan and runs it. The executor processes the plan tree from the inside out, fetching data from the innermost nodes and passing it up to the outer nodes until the final result emerges at the top.

The third step, planning, is where everything interesting happens. And EXPLAIN lets you see what the planner decided.

Your First Query Plan

Let us start simple. You have a table called orders with a million rows. You want all orders placed in the last seven days. You type EXPLAIN ANALYZE, followed by your SELECT statement, and you get back something that looks like gibberish.

The output says Seq Scan on orders, with a cost in parentheses showing two numbers separated by two dots, something like zero point zero zero to twenty-two thousand, and then an estimated row count, actual time in milliseconds, and actual row count.

Let us unpack this line by line, because every piece of it tells a story.

Seq Scan means sequential scan. Postgres is reading every single row in the table, from the first page on disk to the last. It is the brute force approach, the database equivalent of reading every page of a book to find one paragraph. It always works, but it is rarely the fastest option for selective queries.

The cost shows two numbers. The first is the startup cost, the work Postgres has to do before it can return the first row. For a sequential scan, this is zero, because Postgres can start returning rows immediately. The second number is the total estimated cost to process every row. These numbers are not milliseconds or bytes or anything human. They are arbitrary units, anchored to a single baseline. One unit equals the cost of reading one page from disk sequentially. Everything else is measured relative to that. Reading a page at random costs four units by default, because seeking to a random spot on a spinning hard drive is roughly four times slower than reading the next page in sequence.

The row count shows how many rows Postgres estimates it will return, and with ANALYZE, how many it actually returned. If the estimate says a thousand but the actual says five hundred thousand, you have a problem. The planner made its decisions based on bad information.

The actual time shows how long the node took in milliseconds. This is the real number, the one you care about. Estimated costs are useful for understanding the planner's reasoning, but actual time tells you what really happened.

Now here is the key insight. If you add an index on the order date column and run the same query, the plan changes. Instead of Seq Scan, you see Index Scan using your index name on orders. The cost numbers drop dramatically. The actual time drops from hundreds of milliseconds to single digits. The planner looked at your query, considered both the sequential scan and the index scan, estimated that the index scan would be cheaper, and chose it.

This is the planner doing exactly what Selinger designed it to do in nineteen seventy-nine. Estimating costs, comparing plans, picking the winner. The difference is that now you can see the decision.

The Three Flavors of Scanning

Postgres has three main ways to read data from a table, and understanding when it picks each one is the foundation of performance work.

Sequential scan reads every page and every row. It is the simplest strategy, and for small tables or queries that need most of the rows, it is often the fastest. If your table has a hundred rows, Postgres will almost always choose a sequential scan regardless of what indexes exist, because reading a hundred rows in a straight line is faster than jumping around through an index tree. If your query returns seventy percent of the table, a sequential scan is again the right choice, because the overhead of using an index and then fetching each row individually would be worse than just reading the whole thing.

Index scan uses a B-tree or other index to find the specific rows that match your condition, then fetches those rows from the table's data pages. It is fast for selective queries, the ones that return a small fraction of the total rows. The downside is random I/O. Each matching row might be on a different data page, and jumping from page to page is expensive, especially on spinning disks. This is why the default random page cost is four times the sequential page cost. The planner literally models the fact that random access is slower.

Bitmap scan is the clever middle ground. It works in two phases. First, a Bitmap Index Scan reads the index and builds a bitmap, a map of which data pages contain matching rows. Second, a Bitmap Heap Scan reads those pages in physical order, the same order they sit on disk, which turns random I/O into something closer to sequential I/O. The planner picks a bitmap scan when too many rows match for a plain index scan to be efficient, but not so many that a full sequential scan would be better.

Here is the mental model. Imagine a library. A sequential scan is walking down every aisle, checking every book. An index scan is looking up a specific book in the card catalog, walking directly to its shelf, then going back to the catalog for the next one. A bitmap scan is looking up all twenty books you need in the catalog, writing down their shelf locations, sorting those locations by aisle number, and then walking through the library once, grabbing each book as you pass its shelf. Same result, far less walking.

If you see a sequential scan on a large table in your EXPLAIN output and the query is only returning a handful of rows, that is a red flag. It usually means the planner could not find a suitable index. Either the index does not exist, or something is preventing the planner from using it. We will get to those somethings in a few minutes.

The Numbers Game: Reading Costs and Estimates

Understanding the cost numbers is less about memorizing formulas and more about developing intuition. Here is what matters in practice.

The cost of a sequential scan is roughly the number of pages in the table times one, plus the number of rows times zero point zero one. For a table with ten thousand pages and a million rows, that is about twenty thousand cost units. For an index scan returning a hundred rows from the same table, the cost might be four hundred, because you are reading maybe a hundred random pages at four units each, plus the cost of traversing the index. The planner compares twenty thousand to four hundred and picks the index scan. Simple arithmetic, profound consequences.

But here is where it gets interesting. The planner's estimates depend on statistics. Postgres maintains a system table called pg_statistic that stores information about every column in every table. How many distinct values it has. What the most common values are. How the values are distributed. The planner uses these statistics to estimate how many rows a query will return, which directly affects which plan it chooses.

If those statistics are wrong, the plan is wrong. And statistics go stale. Every time you insert, update, or delete rows, the real data drifts away from the statistics. Postgres has an auto-vacuum daemon that periodically updates statistics, but there are situations where it falls behind. A bulk import of a million rows. A mass deletion. A table that grows rapidly. In those cases, the planner is making decisions based on a map that no longer matches the territory.

[slow] This is the single most common cause of mysterious performance problems. The query was fast yesterday and slow today. Nothing in the code changed. But the data changed, the statistics did not keep up, and the planner started making different decisions. Running ANALYZE on the affected tables forces Postgres to recalculate the statistics, and the problem vanishes.

There was a production incident at a company where CPU usage spiked to eighty percent because a complex query joining six tables had never had its statistics updated. The query went from timing out after ten minutes to finishing in under twenty seconds after a single ANALYZE command. <break time="1s"/> Ninety-six percent faster. The data was fine. The indexes were fine. The query was fine. The statistics were stale.

How Postgres Joins Tables

When your query joins two or more tables, the planner has three strategies to choose from. Each one is good at different things, and seeing which one the planner chose tells you a lot about what it thinks it knows about your data.

Nested loop is the simplest. For each row in the outer table, scan the inner table to find matches. If the inner table has an index on the join column, this is fast, because each inner lookup is an index scan returning one or a few rows. But if the inner table does not have an index, or if the outer table has many rows, nested loop becomes expensive. It is the right choice when one side of the join is small or when there is a good index on the inner side. It is a disaster when the planner underestimates the outer row count, because instead of scanning the inner table ten times as expected, it scans it ten thousand times.

Hash join works differently. Postgres reads one table, usually the smaller one, and builds a hash table in memory keyed on the join column. Then it reads the other table row by row, hashing each join key and probing the hash table for matches. This is efficient for larger tables as long as the hash table fits in memory. If it does not fit, Postgres spills it to disk in batches, which hurts performance. Hash joins only work for equality conditions, where the join uses an equals sign. If your join says "where a dot value is less than b dot value," a hash join cannot help.

Merge join sorts both tables on the join column and then walks through them in parallel, like two fingers sliding down two sorted lists. Every time the values match, it produces a result row. This is elegant and efficient, especially when both sides are already sorted, perhaps because they are being read from an index in order. Like hash joins, merge joins require at least one equality condition. The beauty of a merge join is that each table is scanned only once, but the sorting step can be expensive if the data is not already ordered.

When you see a Hash Join in your EXPLAIN output, the planner is saying "I think the smaller table fits in memory." When you see a Nested Loop with an index scan on the inner side, the planner is saying "I think the outer table has very few rows." When you see a Merge Join, the planner is saying "both sides are big, but I can get them sorted cheaply." Each choice is a window into the planner's model of your data.

The Five Performance Killers

There are five things that reliably make queries slow, and all five are visible in EXPLAIN output if you know what to look for.

The first killer is the missing index. You see a sequential scan on a large table when only a few rows match the filter. The fix is usually obvious: create an index on the column in the WHERE clause. But be careful. Not every sequential scan is a problem. On a small table, a sequential scan is faster than an index scan. The planner knows this. Only worry about sequential scans on tables with thousands of rows or more where the query is filtering to a small subset.

The second killer is the type mismatch. Your column is an integer, but your application is sending the value as text. Or your ORM is casting a parameter to numeric when the column is integer. Postgres cannot use a B-tree index when the types do not match, because it has to cast every value in the index to compare, which defeats the purpose of the index entirely. This is invisible in your application code. You will never see it in the SQL your AI generated. But in the EXPLAIN output, you see a sequential scan where you expected an index scan, and the filter condition shows a type cast that should not be there.

[sigh] I spent three hours debugging a slow query before I realized my ORM was sending every parameter as a string. Three hours. The index was right there. It just was not being used because of an invisible type cast.

The third killer is the function on a column. If your WHERE clause says something like "lower of email equals some value," Postgres cannot use a regular index on the email column. The index stores the original values, but you are searching for a transformed version. The fix is an expression index, an index that stores the result of the function. But your AI assistant will almost never suggest this, because it requires understanding the specific mismatch between the query pattern and the index definition.

The fourth killer is the stale statistics we already discussed. But there is a subtler variant. Sometimes the statistics exist for the table columns, but they do not exist for the expression you are actually querying. An engineering team at Render discovered this the hard way. They had a simple join between an events table with ninety million rows and a database table with a hundred thousand rows. The join was on a value extracted from a JSONB column. Postgres had statistics on the JSONB column itself but not on the extracted value. The planner chose a merge join and estimated it would process ninety million rows. The query took hours. [excited] When they created statistics on the specific expression, using CREATE STATISTICS, the exact same query finished in under one millisecond. The planner switched from a merge join to a nested loop, because it finally understood how few rows actually matched.

The fifth killer is the correlated subquery. This is a query inside a query where the inner query references values from the outer query. Postgres is forced to execute the inner query once for every row in the outer query. If the outer query returns ten rows, you run the subquery ten times. If it returns a hundred thousand rows, you run it a hundred thousand times. This is the database equivalent of the N plus one problem in application code. The fix is usually rewriting the subquery as a join, which lets the planner choose a hash join or merge join instead of being locked into nested loops. Postgres can sometimes optimize EXISTS and NOT EXISTS subqueries automatically, but scalar subqueries in the SELECT clause almost always become nested loops.

pg_stat_statements: The Long Memory

EXPLAIN ANALYZE shows you what is happening right now, with one query, at this moment. But what if you want to know which queries have been slow over the last week? Which ones run most often? Which ones consume the most total time? For that, you need pg_stat_statements.

pg_stat_statements is an extension that tracks every query Postgres executes. Not the individual executions, but the patterns. It normalizes queries by replacing literal values with placeholders, so "select from users where id equals forty-two" and "select from users where id equals ninety-seven" are counted as the same query pattern. For each pattern, it records how many times it ran, the total execution time, the average execution time, how many rows it returned, and how many disk pages it read from cache versus from disk.

I check pg_stat_statements every Monday morning. Sort by total time descending. The top ten queries are where eighty percent of your optimization effort should go. Not the slowest individual query. The ones that consume the most total time across all their executions.

Enabling it requires adding pg_stat_statements to the shared preload libraries setting in your Postgres configuration, which means a restart. Then you create the extension in your database. After that, it silently accumulates data in the background. No application changes. No code instrumentation. It just watches.

There is a companion tool called auto_explain that takes this idea further. Instead of manually running EXPLAIN ANALYZE on individual queries, auto_explain automatically logs the execution plan for any query that exceeds a duration threshold you set. Every query that takes longer than, say, a hundred milliseconds gets its full plan written to the Postgres log. This is invaluable in production, where you cannot sit at a terminal running EXPLAIN on every suspicious query. But there is a catch. When you enable the ANALYZE option in auto_explain, it adds timing overhead to every single query, not just the slow ones. For high-throughput systems, you can set a sample rate to only capture a fraction of plans.

Together, pg_stat_statements and auto_explain give you something that EXPLAIN ANALYZE alone cannot. A historical record. Trends over time. The ability to notice that a query that used to take five milliseconds is now taking fifty, and to catch that drift before it becomes a crisis.

The Vibecoder's Debugging Workflow

Here is where vibe coding and database performance actually work well together.

Step one. Find the slow query. If you have pg_stat_statements running, sort by total execution time or by mean execution time and look at the top offenders. If you do not have it enabled yet, enable it. It is one of the first things you should set up on any Postgres installation. If you are responding to an immediate complaint, check your application logs for the specific query.

Step two. Run EXPLAIN ANALYZE on the query. If the query modifies data, wrap it in BEGIN and ROLLBACK so the changes are not committed. Copy the full output.

Step three, and this is the part that surprises people, paste the EXPLAIN output into your AI assistant. This is genuinely one of the best uses of AI in database work. The AI cannot see your data, your indexes, your table sizes. But when you give it the EXPLAIN output, you are giving it exactly what it needs. Estimated rows versus actual rows, scan types, join strategies, cost breakdowns. The AI can read this and tell you, in plain language, what the planner is doing and what looks wrong. It can spot a sequential scan that should be an index scan. It can notice a ten-thousand-fold mismatch between estimated and actual rows. It can suggest that your statistics are stale or that a type cast is preventing index use.

[laugh] The irony is that the vibecoder who never learned to read EXPLAIN output can actually get pretty far by using the AI as a translator. You give it the raw plan, it gives you the diagnosis. You do not need to memorize what every node type means. You need to know that the information exists and how to extract it.

Step four. Fix the problem. Add the missing index. Run ANALYZE. Rewrite the subquery. Fix the type mismatch. Adjust random_page_cost if you are running on solid state drives, because the default of four assumes spinning disks and penalizes index scans too heavily on modern hardware.

Step five. Run EXPLAIN ANALYZE again and verify that the plan changed. Do not trust the fix. Verify it. Compare the actual times. Make sure the planner is using the strategy you expect.

Step six. If you want to go deeper, paste the EXPLAIN output into a visualization tool. explain.depesz.com was created in two thousand seven by a Postgres community member named Hubert Lubaczewski and has been the standard tool for nearly two decades. It parses the raw EXPLAIN output and shows you per-operation timing, highlights the expensive nodes, and adds helpful formatting like thousand separators for large numbers. explain.dalibo.com offers a visual tree-based view that makes the plan structure easier to grasp at a glance. Both are free. Both accept the same EXPLAIN output you would paste into your AI.

[calm] The point of this workflow is not to make you an expert in query plan internals. It is to close the feedback loop. Without EXPLAIN, database performance is a black box. You change something and hope it gets faster. With EXPLAIN, you can see exactly what changed, exactly what the planner is doing differently, and exactly why the query is now fast. The database was always trying to tell you what was going on. You just needed to know how to ask.

The Eight Databases at Three AM

Let me give you a concrete picture of why this matters. On the VPS that hosts this podcast's infrastructure, there are eight Postgres databases. A live news aggregation system that queries multiple data sources and needs to be fast for real-time display. A music catalog. A productivity suite. An e-commerce platform. A photography database. A newsletter system. An archive with vector embeddings. And a general purpose database for the website itself.

All eight share the same Postgres instance, the same CPU, the same memory, the same disk. When one database's queries start hogging resources, the others slow down. This is not a theoretical concern. It happens. A news aggregation query that scans a table without an index does not just slow itself down. It fills the shared buffer cache with pages that evict cached pages from the other databases. It saturates disk I/O. It consumes CPU cycles that the music catalog needs for its own queries.

pg_stat_statements across all eight databases tells you which queries are consuming the most total resources. EXPLAIN ANALYZE on the worst offenders tells you why. And the fix is usually one of the five killers we covered. A missing index. A stale statistic. A type mismatch nobody noticed. The database was giving signals the whole time. You just have to read them.

The best performance work is not heroic optimization. It is noticing the obvious problems before they compound. Check your slow queries weekly. Run ANALYZE after big data loads. Make sure your indexes match your actual query patterns, not the ones you imagined when you designed the schema.

What You Now Know

You can find and fix slow queries. You know what Postgres does before it touches your data, four stages from parsing to execution, with the planner at the center making cost-based decisions that trace their lineage back to Patricia Selinger's work in nineteen seventy-nine. You know how to read the three scan types, sequential, index, and bitmap, and what each one tells you about the planner's reasoning. You know the five common killers, missing indexes, type mismatches, function-wrapped columns, stale statistics, and correlated subqueries. You know how to use pg_stat_statements for the long view and auto_explain for automatic monitoring. And you know that one of the best uses of your AI assistant is as an EXPLAIN output interpreter, turning raw query plans into plain language diagnoses.

[serious] This is real power. Not the kind of power that comes from memorizing syntax or mastering a framework. The kind that comes from seeing what is actually happening inside the machine.

But knowing how to diagnose a slow query assumes the database is running. It assumes the server is up, the data is intact, the disks are healthy. [worried] What about when those assumptions fail? What about the backup you were supposed to test but never did? What about the replication lag that went unnoticed until the primary died? What about the three AM phone call that every database administrator dreads?

That is next time. Episode ten. The three AM phone call.