top of page

DuckDB: Turning “This Will Crash” Into “Oh, That Was Fast”

  • Writer: Mic
    Mic
  • Jan 5
  • 5 min read

There is a predictable moment in many projects using it where Pandas starts to feel uncomfortable. Nothing is technically broken. The code still works. But performance begins to degrade, memory usage climbs, and workflows that once felt simple become fragile. You load multiple CSV files, perform joins, compute aggregations, and suddenly your laptop fans are competing for attention with your IDE.

This friction is not a failure of Pandas. It is a signal that you are moving from dataframe manipulation toward analytical workloads. That transition is exactly where DuckDB shines.

DuckDB is an embedded analytics database that runs directly inside your Python process. It behaves like a high-performance SQL engine, but without any server setup. Once you understand what it does well, it often changes how you design data pipelines entirely.

I guess the Image generator totally understood the content of this post.

The Problem With Pandas at Scale

Pandas is extremely powerful for in-memory data manipulation, but it was never intended to function as a query engine over multiple large datasets.

When working with larger files, several inefficiencies begin to appear. Reading data repeatedly from disk becomes expensive, joins duplicate memory, and complex transformation pipelines grow harder to reason about. Even something conceptually simple like aggregating customer totals across multiple yearly files can require substantial memory overhead.

Consider a straightforward example:

This code is perfectly reasonable and a very simple question. Take the sales data of two years, let's say it includes customer_id, date of a sale, amount of the sale and sum up per customer_id and sort afterwards.

The problem is not correctness — it is scalability. As file sizes increase, memory pressure grows quickly because Pandas must materialize everything before computation begins.

DuckDB approaches this problem differently. Instead of loading all data first and computing second, it treats files as queryable data sources and processes them efficiently during execution.


What DuckDB Actually Is

DuckDB is often described as “SQLite for analytics,” but that analogy only partially captures its capabilities. Like SQLite, it runs embedded inside your application and requires no server. Unlike SQLite which uses an OLTP structure, it is designed specifically for analytical workloads, using an OLAP structure. This means it is optimized for operations like aggregations, joins, and scans. Or in other words, SQLite looks at data row-wise, very efficient for countless small transactions, while DuckDB looks at data column-wise, very efficient for aggregations, joins and similar operations.

But the language used by DuckDB is very much SQL language

And we have already started querying a database engine.


Querying Files Like Tables

As already seen in the basic example above, one of the most compelling features of DuckDB is its ability to treat files as tables directly. There is no import step required. Instead, DuckDB reads data lazily and only processes what is necessary for the query.

For example:

Here the CSV file is referenced directly inside SQL. DuckDB scans only the required columns and applies the filter during reading. The .df() method converts the result into a Pandas dataframe.

Even more interesting is that DuckDB can read multiple files automatically using patterns:

Without writing loops or concatenation logic, DuckDB processes all matching files as one logical dataset.


Working With Pandas and Polars

DuckDB integrates naturally with existing Python data tools. You can query Pandas dataframes directly without any explicit registration.

DuckDB automatically detects the sales variable and treats it as a table. The same concept works with Polars:

This ability effectively turns DuckDB into a universal query layer across Python data structures. These two standard dataframe libraries are not the only ones that can be used. It can query parquet and feather files, JSON and delimiter files, NumPy arrays, PyArrow tables, SQLite databases to just name a few of the most commonly used cases.

Performance Surprises

DuckDB’s speed often surprises people because it behaves more like a modern analytical database than a dataframe library. Internally it uses columnar processing, vectorized execution, and aggressive query optimization techniques. These design decisions allow it to outperform Pandas significantly for many aggregation and join workloads.

A simple comparison illustrates the difference:

In this example 'large_file.csv' is a randomly generated file (using chatGPT) with over 1 million rows and columns: name, category, and value. On an old laptop the numbers I get are here

Pandas: 0.213
DuckDB: 0.053

A speed increase factor of 4 and this example does not even use memory expensive procedures like a join of two datasets by adding a second 1 million rows file.


A Real Pipeline Example

To understand how DuckDB fits into workflows, it helps to look at a realistic pipeline. Suppose you want to ingest multiple raw CSV files, clean them, compute customer metrics, and export results for downstream analysis.

DuckDB can handle the entire pipeline:

It loads multiple files fitting the pattern 'sales_*.csv', selecting those entries where the amount is not missing. Processes the data by aggregating the amounts, grouping by customer_id and counting number of orders they had. And finally saving all of this to a parquet file. Especially the latter is an important feature of DuckDB, it is very fast in the use of parquet files.

Instead of multiple Pandas steps, intermediate files, and memory duplication, the workflow becomes a compact SQL-driven transformation pipeline. Note especially that we never needed to load the individual files in the first place and we could have combined the two database execution steps in one as well.


When DuckDB Is the Wrong Tool

Despite its strengths, DuckDB is not a universal solution. It is not intended to replace transactional databases or large distributed systems. If your application requires high concurrency, continuous writes, or multi-node scaling at massive scale, other systems remain more appropriate.

DuckDB excels in a specific niche: local analytical processing where performance and simplicity matter more than distributed infrastructure.


Architecture Patterns

In practice, DuckDB often appears as a component rather than the center of a system. It can act as an analytics helper inside Python applications, a local query engine over data lakes, or an embedded Extract-Transform-Load processor that transforms raw data into Parquet files. In notebooks, it frequently serves as a performance accelerator that replaces slow dataframe operations with optimized SQL queries.

One of the most interesting architectural shifts DuckDB enables is removing unnecessary complexity. Many workloads that previously required Spark or heavy orchestration can now run comfortably on a single machine with far simpler code.


Small Tool, Big Impact

DuckDB isn’t just a faster way to run queries — it’s one of those tools that makes you question some of your past life choices. Suddenly, files behave like tables, SQL turns into a perfectly reasonable scripting language, and the phrase “this won’t fit into memory” loses most of its emotional weight. Pipelines that previously required three libraries, two temp folders, and a mild existential crisis can often be replaced by a handful of readable steps that actually make sense when you revisit them a month later.

For Python developers working with data that has grown beyond cute tutorial sizes, DuckDB is easily one of the highest-leverage tools you can learn. It doesn’t demand new infrastructure, doesn’t require you to become a distributed systems engineer overnight, and won’t judge you for the questionable CSVs you’ve accumulated over the years. It just quietly makes things faster, simpler, and far less painful — which, honestly, is what we wanted all along.


Comments


bottom of page