top of page

Three DataFrame Libraries — Three Mental Models

  • Writer: Mic
    Mic
  • Feb 23
  • 5 min read

There are many comparisons of pandas, Polars, and Dask. Most of them focus on speed.

  • Which one is faster?

  • Which one scales better?

  • Which one replaces the other?

That framing misses the real difference.

These libraries are not simply competitors. They embody three fundamentally different ways of thinking about data processing. If you understand those mental models, you understand when each tool makes sense — and when it does not.

When you tell an LLM to transform a DataFrame library discussion into something cute...

This post is not about benchmarks. It is about how these libraries want you to think.


The Same Task, Three Ways

We will use a simple scenario throughout:

  • Load a CSV file

  • Filter rows

  • Group by a column

  • Aggregate

  • Save the result

Assume a file sales.csv:

date,region,product,amount
2025-01-01,EU,A,100
2025-01-02,US,A,200
2025-01-03,EU,B,150
...

We want EU sales, grouped by product, aggregated by summing the amount. The code for each library looks deceptively similar. The meaning is not.


Pandas — The Spreadsheet Mental Model

With pandas, the implementation is straightforward and probably known to nearly everyone that has ever used data in Python

There is no hidden phase here. Each line executes immediately. When you filter, it filters. When you group, it groups. When you sum, it sums.

Pandas behaves like an interactive spreadsheet engine with a Python API. You perform an operation and the result materializes at once. Intermediate states exist in memory. You can inspect them, print them, modify them.

The mental model is procedural and local:

Take this table. Modify it. Now modify it again.

This model is extremely powerful in exploratory work. It matches how analysts think. It rewards experimentation. It provides immediate feedback.

But it carries an implicit assumption: the full dataset lives comfortably in memory.

As long as that assumption holds, pandas is wonderfully simple. When it breaks, everything starts to feel heavy. Chained operations create multiple intermediate objects. Memory usage becomes unpredictable. Performance degrades in ways that are not obvious.

Pandas does exactly what you ask — immediately. It does not try to be clever.

That simplicity is both its strength and its limitation.


Polars — The Query Engine Mental Model

Now consider Polars. Polars can act in the same way as pandas, but its strength lies in lazy mode:

At first glance, this looks similar. But something subtle has changed.

scan_csv does not load the data immediately. The filter does not execute immediately. The groupby does not execute immediately.

Nothing actually happens until collect().

Polars builds a logical plan of your transformations. Only when you request the result does it optimize and execute that plan.

This is no longer a spreadsheet mental model. It is a query engine mental model.

You are not describing steps in time. You are describing a transformation. Polars can then:

  • Reorder operations

  • Push filters down

  • Eliminate unused columns

  • Fuse operations

  • Stream data

Because it sees the whole pipeline before execution. That is a qualitative shift.

Instead of thinking:

First filter, then group, then sum.

You think:

This is the transformation I want.

Polars is conceptually closer to a local analytical database than to a spreadsheet. Its Rust backend and columnar execution are implementation details. The deeper shift is that you allow the engine to decide how to execute your request efficiently.

This becomes particularly powerful in longer pipelines:

Those filters can be combined. The projection can be pushed earlier. Memory usage can be minimized.

You are no longer micromanaging intermediate DataFrames. You are defining intent. This in turn means that you don't know in which order the operation are performed if there is under the hood optimization happening.


Dask — The Distributed Systems Mental Model

Dask looks almost identical to pandas:

The syntax is familiar by design. But internally, something happens.

Dask does not operate on one DataFrame. It operates on many partitions. When you call read_csv, it splits the file into chunks. When you apply transformations, it builds a task graph. When you finally compute, it schedules those tasks across threads, processes, or even multiple machines. We will not go into details how the partitioning happens, this can be done via different parameters in read_csv, depending on your preference

Note the to_csv command at the end. Instead of a single output file this produces as many as there are partitions. We skipped the step where you still combine all partitions and make a final aggregation for the partitioned result. Remember that result is still partitioned here. If you want to immediately go back to a single dataset, then Dask might not be right for your application.

This is no longer a spreadsheet model. It is not even a query engine model. It is a distributed systems model.

You are now working with:

  • Partitioned data

  • Partial results

  • Task scheduling

  • Coordination between workers

Operations like groupby become multi-stage processes: local aggregations on partitions followed by a global aggregation. Failures are no longer simple exceptions — they may involve worker crashes, memory pressure on one node, or scheduler bottlenecks.

Dask introduces the complexity of orchestration. In return, it allows you to scale beyond a single machine.

The mental shift is significant. You are no longer manipulating one dataset. You are coordinating computation across many pieces of it.

In syntax, Dask is very close to pandas itself. It could easily be described as distributed pandas.


Why the Mental Model Matters More Than Speed

It is tempting to reduce this comparison to performance. But the deeper distinction lies in scaling philosophy.

  • Pandas assumes you can scale up — get more RAM.

  • Polars assumes you can scale smarter — optimize execution.

  • Dask assumes you can scale out — distribute computation.

Each philosophy introduces trade-offs. Pandas gives simplicity and immediacy. Polars gives optimization and efficiency without distributed complexity. Dask gives horizontal scalability at the cost of coordination overhead and conceptual complexity.

Understanding these trade-offs is far more valuable than memorizing syntax differences.


The Subtle Progression

There is a natural learning path hidden here. Pandas teaches you how to manipulate data. Polars teaches you how execution engines think. Dask teaches you how systems scale.

Once you move beyond small, notebook-based datasets, the question is no longer “Which library is faster?” but:

  • Do I need execution planning?

  • Do I need memory optimization?

  • Do I need distributed scheduling?

Those are architectural questions. And architecture is where many industry systems either remain maintainable — or collapse under their own assumptions.


Choosing the Tool

If your dataset fits comfortably in memory and you are exploring interactively, pandas is often the right answer. Of course no one is keeping you from using Polars for this and remining everyone of its extremely fast Rust backend.

If your dataset is large but still manageable on a single machine and you care about efficiency and optimization, Polars becomes very attractive.

If your data exceeds one machine or you need parallel processing across partitions, Dask enters the picture.

But more important than the choice itself is understanding why you are choosing. When people say “Polars is faster than pandas,” they are often observing the benefits of a query planner and a columnar execution engine.

When people say “Dask is just big pandas,” they underestimate the complexity of distributed task scheduling.

These tools solve different problems because they assume different worlds.


From Syntax to Systems

The real difference between pandas, Polars, and Dask is not syntax. It is not speed. It is not hype. It is the mental model they impose on you. Pandas asks you to think locally and procedurally. Polars asks you to describe transformations and trust an optimizer. Dask asks you to think in partitions, tasks, and coordination.

As your data grows, your mental model has to grow with it. Understanding that progression is what turns a library choice into an architectural decision. And that is where things start to get interesting.


Bonus

If you want to experiment a bit with the dataset, here is a small Python script that randomly generates a appropriate dataset:


Comments


bottom of page