108 lines
2.4 KiB
Markdown
108 lines
2.4 KiB
Markdown
|
|
# Distributed Query Engines
|
||
|
|
|
||
|
|
A reference for what changes when query execution moves from one machine to many.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Short answer
|
||
|
|
|
||
|
|
A distributed query engine is not just a single-node engine with remote workers.
|
||
|
|
|
||
|
|
Once execution is spread across machines, the engine needs extra architecture for:
|
||
|
|
|
||
|
|
- partitioning data
|
||
|
|
- moving data between stages
|
||
|
|
- scheduling tasks
|
||
|
|
- handling failures
|
||
|
|
|
||
|
|
Those concerns become first-order design problems.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The basic shape
|
||
|
|
|
||
|
|
Most distributed engines have:
|
||
|
|
|
||
|
|
- a coordinator or planner
|
||
|
|
- workers or executors
|
||
|
|
- partitioned input data
|
||
|
|
- exchange or shuffle steps
|
||
|
|
- a final result collection stage
|
||
|
|
|
||
|
|
The plan is usually broken into fragments or stages that can run in parallel.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What distribution adds
|
||
|
|
|
||
|
|
### Partitioning
|
||
|
|
|
||
|
|
Data is split across machines, often by file boundaries, shards, or key ranges.
|
||
|
|
|
||
|
|
### Exchange / shuffle
|
||
|
|
|
||
|
|
Rows or batches are moved across the network so that later operators see the right grouping or join partition.
|
||
|
|
|
||
|
|
### Scheduling
|
||
|
|
|
||
|
|
The system decides where tasks run and when.
|
||
|
|
|
||
|
|
### Fault tolerance
|
||
|
|
|
||
|
|
The engine needs some strategy for retries, recomputation, or checkpointing.
|
||
|
|
|
||
|
|
### Coordination overhead
|
||
|
|
|
||
|
|
Network, serialization, and orchestration can easily dominate runtime if the plan is badly shaped.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why distributed execution is hard
|
||
|
|
|
||
|
|
The main difficulty is that a plan that looks cheap algebraically may be expensive once network movement is included.
|
||
|
|
|
||
|
|
For example:
|
||
|
|
|
||
|
|
- a join may require shuffling huge datasets
|
||
|
|
- a group-by may need global repartitioning by key
|
||
|
|
- skewed keys may overload one worker
|
||
|
|
|
||
|
|
So distributed optimization is partly about minimizing data movement, not just local operator cost.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common patterns
|
||
|
|
|
||
|
|
### Map-then-reduce style
|
||
|
|
|
||
|
|
Local work happens close to the data, then partial results are shuffled and merged.
|
||
|
|
|
||
|
|
### Stage DAGs
|
||
|
|
|
||
|
|
Execution is represented as a directed acyclic graph of stages separated by exchange boundaries.
|
||
|
|
|
||
|
|
### Partial then final aggregation
|
||
|
|
|
||
|
|
Workers compute local aggregates, then a later stage merges them.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Practical mental model
|
||
|
|
|
||
|
|
Single-node engines mostly optimize CPU, memory, and local I/O.
|
||
|
|
|
||
|
|
Distributed engines must also optimize:
|
||
|
|
|
||
|
|
- network traffic
|
||
|
|
- partition balance
|
||
|
|
- task scheduling
|
||
|
|
- failure recovery
|
||
|
|
|
||
|
|
That is why distributed query processing is a second architecture layer rather than a small extension.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **April 1, 2026** -- First version created.
|