top of page

The Engines and Clusters Behind Python Data Work

  • Writer: Mic
    Mic
  • Feb 26
  • 7 min read

In the last post we compared pandas, polars, and dask. That was the visible part of the iceberg. This post is about everything below the waterline.

Because once you move beyond “does this run?” and start asking:

  • Why is this slow?

  • Why am I out of RAM?

  • Why is my CPU at 12%?

  • Why does this suddenly need DevOps approval?

You are no longer choosing a DataFrame library. You are looking at structural decisions. And the architecture that you need for this structure has layers. In this post, we want to look at some of these layers. We will stick to our basic dataset from the previous post: products with sales amount, regions, etc.

I get the feeling that ChatGPT thinks that Modin is the coolest one when asked to summarize all in an image....

The Memory Layer — Apache Arrow

Before we talk about speed or clusters, we have to talk about memory layout. Most people never think about memory layout. Most people should.

Apache Arrow is not “another DataFrame library.” It is a columnar memory format designed specifically for analytical workloads.

Traditional row-oriented layout looks like this (conceptually):

Columnar layout looks like this:

Why does this matter? Because analytical queries usually operate column-wise:

  • filter by region

  • sum amount

  • group by product

Columnar layout allows:

  • SIMD/vectorized execution

  • better CPU cache usage

  • zero-copy data sharing between systems

On the other hand it is inefficient for frequent data requests and transactions.

Let’s touch it directly:

No pandas. No DataFrame wrapper. Just a memory layout and compute kernels.

Now here’s the interesting part:

  • Polars uses Arrow memory internally.

  • Pandas 2.x can use Arrow-backed columns.

  • Spark can exchange Arrow data.

  • DuckDB speaks Arrow fluently.

Arrow is an infrastructure layer, the plumbing so to speak. You rarely praise plumbing. But you will curse when you don't have it.

Advantages

Arrow’s biggest advantage is architectural: it decouples memory representation from any single DataFrame library. That means Polars, Spark, Pandas (2.x), DuckDB and others can exchange data efficiently without expensive conversions. If you care about interoperability and performance at the memory level, Arrow is foundational. It’s also extremely fast for columnar analytics and enables true zero-copy sharing across processes and languages.

Disadvantages

Arrow is not a full analytics environment. You don’t get the rich API ergonomics of pandas or the execution planning of Spark. Working directly with Arrow can feel lower-level and slightly mechanical. It’s infrastructure, not a cozy abstraction. If you’re just cleaning CSVs, you probably don’t need to think about Arrow explicitly — even though it may already be helping you behind the scenes.


Big Data, Small Laptop — Vaex

Now let’s say your dataset is 40GB. Your laptop has 16GB RAM. Your confidence is high. Your swap file is sweating. This is where Vaex enters the room. It is built around a simple idea:

What if we just… don’t load everything into memory?

Instead, it uses memory-mapped files and lazy evaluation.

By itself this looks very familiar to the syntax of pandas itself ... just a bit leaner.

But here’s what’s happening behind the scenes:

  • The file is memory-mapped.

  • Operations build an expression graph.

  • Execution happens lazily.

  • Only necessary data is touched.

Vaex lives in a fascinating niche:

  • Bigger than RAM of your local system

  • Smaller than “rent a cluster” and get a second mortgage

  • Fast enough for exploration

If pandas says:

“Load everything.”

Vaex says:

“Let’s think about that.”

It's a very quick and easy option to work with large files.

Advantages

Vaex shines when datasets are larger than memory but you don’t want cluster complexity. It enables out-of-core analytics on a single machine and feels relatively familiar if you’ve used pandas. For exploratory work on very large files, it can be surprisingly efficient and practical. It also integrates well with Arrow and other columnar formats.

Disadvantages

Vaex is more specialized and less ubiquitous than pandas or Spark. The ecosystem around it is smaller. Some advanced pandas-style manipulations are not as seamless. If your workload requires heavy transformations, complex joins, or distributed scaling across multiple machines, Vaex starts to show its boundaries. It’s excellent at what it does — but what it does is intentionally constrained.


What If pandas Actually Used Your CPU? — Modin

Everyone knows pandas. A major limitation is that it politely uses one CPU core while your other 15 cores are on vacation. Modin change exactly that. And the API change is almost comically small:

That’s it. Behind the scenes:

  • The DataFrame is partitioned.

  • Tasks are scheduled.

  • Execution runs on a distributed runtime.

That runtime can be:

  • Ray

  • Dask

How is this decided? There is nothing in the code that tells us what Modin is using in our example. This is intentional, the choice of the engine is done via environment variables. Hence we add the following in front of the Modin import

This will tell Modin that we want to use ray for our computing engine. Modin is not rewriting pandas semantics. It is distributing them.

Architecturally, this is important: Modin is not an engine. It is a scaling layer. It delegates execution to something deeper.

Advantages

Modin’s main strength is minimal migration friction. You keep pandas semantics and often see speedups on multi-core machines with little code change. It leverages powerful distributed backends while preserving a familiar API. For teams deeply invested in pandas but needing more parallelism, Modin can be an attractive stepping stone.

Disadvantages

Modin is a wrapper layer. It inherits complexity from its backend. Debugging performance issues can become less transparent, because you’re now dealing with partitioning, scheduling, and distributed execution under the hood. Not all pandas operations are perfectly optimized or supported at scale. In practice, real-world gains depend heavily on workload characteristics. It’s powerful — but not magical.


When Your Dataset Needs a DevOps Team — Apache Spark

Apache Spark is not “bigger pandas.” It is an entire distributed execution ecosystem written on the JVM.

When you write:

You are not just filtering data. You are constructing:

  • A logical plan

  • An optimized plan

  • A physical execution plan

Then Spark:

  • Splits work across executors

  • Manages shuffles

  • Handles fault tolerance

  • Coordinates cluster nodes

Spark is what happens when your laptop stops being “the computer” and starts being just a client. It’s not something you casually reach for to crunch a CSV. With Spark, you provision clusters, tune executors, and find yourself discussing shuffle partitions in meetings. It lives in a different world — one built for distributed, industrial-scale data processing rather than quick, local analysis.

Advantages

Spark excels at large-scale distributed processing. It provides fault tolerance, optimized execution planning, and mature integrations with storage systems and enterprise infrastructure. For truly large datasets — terabytes and beyond — Spark offers reliability and scalability that single-machine tools cannot match. It’s battle-tested and deeply integrated into the big data ecosystem.

Disadvantages

Spark comes with complexity. JVM infrastructure, cluster configuration, shuffle tuning, executor management — this is not lightweight tooling. For small to medium workloads, Spark can feel heavy and over-engineered. There is operational overhead, and performance tuning often requires expertise. It’s extremely powerful — but you pay for that power in setup and maintenance complexity.


The Execution Substrate — Ray

Finally, we reach Ray. Ray is not a DataFrame system, rather it is a distributed execution framework. Think:

  • Task scheduling

  • Actor model

  • Object store

  • Cluster coordination

A tiny example:

This isn’t really about tabular data at all. It’s about distributed computation in the broader sense. Modin can run on Ray. Machine learning workloads run on Ray. Entire reinforcement learning systems run on Ray.

If Spark feels like a dedicated data engine, Ray feels more like a distributed brain. It doesn’t particularly care what kind of workload you give it — data processing, training loops, simulations, serving. What it cares about is scheduling it efficiently and scaling it across resources.


The Layered View

If we zoom out, modern Python data architecture is layered:

Cluster Layer      →   Spark / Ray
Parallel Layer     →   Dask / Modin
Engine Layer       →   Polars / Vaex
Memory Layer       →   Arrow
Classic Layer      →   pandas

Each of these layers exists to solve a different problem.

Apache Arrow tackles memory layout. Polars and Vaex focus on high-performance analytics on a single machine. Modin and Dask introduce parallelism, spreading work across cores. Spark and Ray move even further out, orchestrating distributed execution across multiple machines.

They aren’t competitors in the usual sense. They’re architectural responses to different scaling questions — each one stepping in when the previous layer reaches its limits.

Advantages

Ray is flexible and general-purpose. It supports distributed tasks, actors, and complex ML workflows. It’s particularly strong in AI and machine learning pipelines where you need dynamic task graphs and scalable execution. Its design allows for fine-grained parallelism and integration with modern ML tooling.

Disadvantages

Ray is infrastructure, not a high-level analytics abstraction. You must design your workload explicitly in terms of tasks or actors. It doesn’t automatically give you SQL-style optimization like Spark. For straightforward tabular analytics, Ray can feel lower-level and more hands-on. It’s powerful — but it expects architectural thinking.


So What Should You Actually Use?

Probably pandas.

Let’s be honest — most workloads simply don’t need a cluster.

But the moment you start running into memory limits, notice your CPU sitting idle, require multi-machine scaling, or step into distributed ML workflows, you’re no longer just choosing a DataFrame API. You’re choosing a runtime model. And that’s when the choices start to matter.

Your DataFrame stops being “just a table.” It becomes a thin abstraction over a memory format, an execution engine, a scheduler — and possibly an entire cluster humming somewhere behind the scenes.

Once you see that, the question changes. You stop asking, “Which library is fastest?” and start asking, “Which architectural layer do I actually need?”

That’s a far more interesting question ... and far more important. Your library can be as fast as it claims to be. If there is a serious system bottleneck the speed means little.



Comments


bottom of page