useful-notes/hqew/009-distributed-query-engines.md

2.4 KiB

Distributed Query Engines

A reference for what changes when query execution moves from one machine to many.


Short answer

A distributed query engine is not just a single-node engine with remote workers.

Once execution is spread across machines, the engine needs extra architecture for:

  • partitioning data
  • moving data between stages
  • scheduling tasks
  • handling failures

Those concerns become first-order design problems.


The basic shape

Most distributed engines have:

  • a coordinator or planner
  • workers or executors
  • partitioned input data
  • exchange or shuffle steps
  • a final result collection stage

The plan is usually broken into fragments or stages that can run in parallel.


What distribution adds

Partitioning

Data is split across machines, often by file boundaries, shards, or key ranges.

Exchange / shuffle

Rows or batches are moved across the network so that later operators see the right grouping or join partition.

Scheduling

The system decides where tasks run and when.

Fault tolerance

The engine needs some strategy for retries, recomputation, or checkpointing.

Coordination overhead

Network, serialization, and orchestration can easily dominate runtime if the plan is badly shaped.


Why distributed execution is hard

The main difficulty is that a plan that looks cheap algebraically may be expensive once network movement is included.

For example:

  • a join may require shuffling huge datasets
  • a group-by may need global repartitioning by key
  • skewed keys may overload one worker

So distributed optimization is partly about minimizing data movement, not just local operator cost.


Common patterns

Map-then-reduce style

Local work happens close to the data, then partial results are shuffled and merged.

Stage DAGs

Execution is represented as a directed acyclic graph of stages separated by exchange boundaries.

Partial then final aggregation

Workers compute local aggregates, then a later stage merges them.


Practical mental model

Single-node engines mostly optimize CPU, memory, and local I/O.

Distributed engines must also optimize:

  • network traffic
  • partition balance
  • task scheduling
  • failure recovery

That is why distributed query processing is a second architecture layer rather than a small extension.


Changelog

  • April 1, 2026 -- First version created.