DuckDB: The Database That Lives in a File

Two Researchers and a Quiet Frustration

In two thousand eighteen, two researchers at the Centrum Wiskunde and Informatica in Amsterdam were getting frustrated. The institute is the Dutch national research center for mathematics and computer science. The two researchers, Mark Raasveldt and Hannes Mühleisen, were working on database systems. Specifically, they were working on a problem that everyone in their field knew about but nobody had solved well, which is the problem of how to do serious analytical queries on data that lives on your own laptop.

The state of the art at the time was unsatisfying. If you had a few megabytes of data, you could load it into Excel or into a pandas dataframe in Python and do your analysis there. This worked but did not scale. The moment your data crossed a few hundred megabytes, your tools slowed to a crawl. If you had a few gigabytes, you needed a real database. But the real databases of the era, PostgreSQL or MySQL or Microsoft SQL Server, were designed to run on servers. You installed them, you ran a service, you opened ports, you managed users, you backed them up. They were enterprise software. They were not designed for a researcher who wanted to crunch a thirty-million-row CSV file once and then forget about it.

[calm]

Raasveldt and Mühleisen decided to build the missing piece. They wanted a database that ran inside whatever program you happened to be writing, the way the small embedded database SQLite ran inside whatever program you happened to be writing, but designed specifically for analytical workloads. SQLite is wonderful for storing data your application uses. SQLite is not wonderful for asking complicated questions across millions of rows. The two researchers wanted SQLite, but for analysts.

They called it DuckDB. The name was chosen because they wanted something easy to spell, easy to remember, and not yet trademarked by anyone serious. The duck was also a reference to a duck they kept seeing in the canal outside their office. The duck became the logo. There is now a small but devoted following of people who put duck stickers on their laptops because of this. The most influential database project of the last decade may also have the most ridiculous origin story.

What An Analytical Database Actually Does

To understand why DuckDB matters, you need to understand the distinction between two kinds of database workloads. The first kind is called transactional. You have an application that needs to read or write a small number of rows at a time. Look up this user. Update that order. Insert a new payment. The rows are touched one at a time. Speed comes from being able to find a specific row quickly. This is what most databases were built for.

The second kind is called analytical. You have a researcher or a journalist who wants to ask questions that touch many rows at once. Show me the total revenue by month for the last five years. Show me every company that filed a permit between two thousand twenty and two thousand twenty-three. Show me the average salary by job title across all employees. These questions touch entire columns of data, often summing or grouping millions of values to produce a single answer.

The two kinds of work need different database designs. Transactional databases store rows together, because each operation needs to read or write a whole row. Analytical databases store columns together, because each query needs to read or write a whole column. The same data, stored differently, with different performance characteristics for different questions.

DuckDB is a columnar analytical database packaged as a library. You include it in your Python script or your R session or your Node application, the same way you would include any other library. There is no server to install. There is no port to open. There is no user to manage. The database lives inside your program, and when your program exits, the database is just a file on disk that you can pick up later or copy to another machine.

This is a radically simpler model than what came before. The setup time is zero. The maintenance is zero. The cost is zero. For a researcher who wants to do one piece of analysis and then move on, DuckDB removes the friction that used to make this kind of work expensive.

The Performance Story

The other thing that makes DuckDB unusual is how fast it is. The benchmarks are genuinely surprising. On many analytical workloads, DuckDB matches or beats commercial systems that cost hundreds of thousands of dollars per year. It runs on a laptop. It uses one CPU core or many. It handles datasets that go well beyond the memory size of the machine, by intelligently spilling intermediate results to disk.

The speed comes from a few specific design choices. The columnar storage means that when you ask for the sum of one column across ten million rows, the database only reads that one column. It does not have to read the entire row of data for each row. This can be ten or twenty times faster than a traditional database doing the same query.

The vectorized execution means that the database processes data in chunks of about a thousand values at a time, using the special vector instructions that modern processors provide. A modern CPU can do dozens of additions in parallel on a single instruction. DuckDB takes advantage of this in a way that older databases were never designed to do.

The query optimizer is unusually clever. The optimizer is the part of the database that decides how to actually run your query. Should it use this index or that one? Should it join these tables in this order or that order? Should it cache this intermediate result? DuckDB's optimizer is the work of years of academic research, and it produces query plans that are often better than what a human expert would come up with.

[serious]

The combined effect is that you can ask serious questions of serious amounts of data on a normal laptop, and get answers in seconds. This is genuinely new. The same workloads, on a five-year-old setup, would have required either a server farm or a multi-hour wait. The democratization of analytical computing is one of the most important things that has happened in journalism technology in the last few years.

The Format Story

There is another piece of DuckDB worth knowing about, which is its relationship to data formats. DuckDB can read and write a number of different formats directly, without any conversion step. You can point it at a comma-separated values file. You can point it at a Parquet file, which is a popular columnar storage format. You can point it at a JSON file. You can point it at a database in SQLite format. You can point it at an Excel spreadsheet. You can point it at a Postgres database over the network. You can point it at a CSV file inside a ZIP file on a remote web server. DuckDB will read all of these directly, in place, without copying.

This is unusual. Most databases require that you first import data into their own internal format before you can query it. DuckDB does not. You can run a query that joins a CSV file from one source against a Parquet file from another source against a SQLite database from a third source, all in one query, all without any explicit import step. The database figures out how to read each source and processes the data in place.

This matters for the kind of work where the data is constantly changing. A journalist tracking a public dataset that updates daily can simply query the latest version of the file. There is no import. There is no schema migration. There is just a query that reads the current state of the world. If the file changes tomorrow, the query reads the new state tomorrow.

The Spatial Extension

One specific feature of DuckDB that matters for anyone doing geographic work is the spatial extension. The extension adds full support for geographic data types and operations, modeled after the PostGIS extension to PostgreSQL, which has been the gold standard for spatial databases for two decades.

With the spatial extension, you can store points and lines and polygons in your DuckDB database. You can ask spatial questions. Show me every exploration permit whose polygon intersects this national park. Show me every village within five kilometers of an active mine. Show me the total area of land covered by mining concessions in this county. These questions, which used to require a dedicated spatial database, can now be asked from inside a Python script with no setup.

For journalism that involves mapping, this is a meaningful capability. A small newsroom can do spatial analysis that previously required either a paid commercial GIS or a complex self-hosted database stack. The spatial extension is open source. It is fast. It reads the standard geographic formats including GeoJSON and Shapefile and GeoPackage. It plays well with QGIS, which can use a DuckDB database as a data source.

The combination of analytical performance plus spatial capabilities plus the ability to read many file formats plus the ability to run anywhere is what makes DuckDB suitable as the engine for an investigative data pipeline. You can pull data from government sources daily. You can store it in Parquet files in a folder. You can run analyses across years of accumulated data in seconds. You can join geographic data against corporate data against time series data, all in one query. The friction is gone.

What Makes It Sustainable

The thing worth thinking about with DuckDB is how the project has stayed sustainable. The two original researchers founded a company called DuckDB Labs, based in the Netherlands. The company does not sell the database, because the database is free and open source under the MIT license. The company sells services. Consulting, training, custom development, premium support. The database itself remains free and accessible to anyone.

This is a model that has become increasingly common for open source database projects. The company that maintains the project makes its money from services around the project rather than from licensing the project itself. The advantage is that the project stays open and the community trusts the project. The disadvantage is that the company has to be very careful about what services it sells, because anything that would compete with what users could do themselves with the open project would erode trust.

DuckDB Labs has navigated this well. They have not added closed features to the database. They have not held back important capabilities for paying customers. The open project gets better and better, and the company sells expertise to companies that want help using it. This is the model working as intended, and it is one reason DuckDB has captured so much of the analytical database market in such a short time.

What This Means For A Working Reporter

The practical use of DuckDB for journalism is that it sits in the middle of the data pipeline as the analytical engine. The git scrapers collect data. The Parquet files accumulate over time. DuckDB queries them when you have a question. The output of the query becomes the input to QGIS for mapping, or to a chart library for visualization, or to a text article that quotes the specific numbers.

The same engine works for investigative database work, where you have a slowly growing collection of facts about your beat, and you want to ask cross-cutting questions occasionally. The same engine works for one-off analyses where you have a leaked spreadsheet that you want to understand. The same engine works for spatial analyses where you have geographic data alongside non-geographic data and you want to combine them.

[calm]

The friction reduction is the real story. Five years ago, doing any of this required either a commercial product or a complex self-hosted setup. Today, it requires installing one library and writing a few lines of code. The cost has collapsed in a way that genuinely changes what is possible for a one-person newsroom. The same tools that data teams at major papers use are now available, free, on any laptop. The skill is what matters. The infrastructure is no longer a barrier.

For a working reporter, the relevant move is to learn the SQL query language. DuckDB speaks SQL, which has been the standard query language for databases for fifty years and is not going away. Investing in SQL is investing in a skill that will be useful for the rest of your career, regardless of which specific database you happen to be using. Twenty years from now, the database in vogue will probably be something else. The SQL will still work. The investment compounds. The duck on the laptop sticker is a reminder of where the friction used to be.