useful-notes/hqew/009-distributed-query-engines.md

108 lines
2.4 KiB
Markdown
Raw Normal View History

# Distributed Query Engines
A reference for what changes when query execution moves from one machine to many.
---
## Short answer
A distributed query engine is not just a single-node engine with remote workers.
Once execution is spread across machines, the engine needs extra architecture for:
- partitioning data
- moving data between stages
- scheduling tasks
- handling failures
Those concerns become first-order design problems.
---
## The basic shape
Most distributed engines have:
- a coordinator or planner
- workers or executors
- partitioned input data
- exchange or shuffle steps
- a final result collection stage
The plan is usually broken into fragments or stages that can run in parallel.
---
## What distribution adds
### Partitioning
Data is split across machines, often by file boundaries, shards, or key ranges.
### Exchange / shuffle
Rows or batches are moved across the network so that later operators see the right grouping or join partition.
### Scheduling
The system decides where tasks run and when.
### Fault tolerance
The engine needs some strategy for retries, recomputation, or checkpointing.
### Coordination overhead
Network, serialization, and orchestration can easily dominate runtime if the plan is badly shaped.
---
## Why distributed execution is hard
The main difficulty is that a plan that looks cheap algebraically may be expensive once network movement is included.
For example:
- a join may require shuffling huge datasets
- a group-by may need global repartitioning by key
- skewed keys may overload one worker
So distributed optimization is partly about minimizing data movement, not just local operator cost.
---
## Common patterns
### Map-then-reduce style
Local work happens close to the data, then partial results are shuffled and merged.
### Stage DAGs
Execution is represented as a directed acyclic graph of stages separated by exchange boundaries.
### Partial then final aggregation
Workers compute local aggregates, then a later stage merges them.
---
## Practical mental model
Single-node engines mostly optimize CPU, memory, and local I/O.
Distributed engines must also optimize:
- network traffic
- partition balance
- task scheduling
- failure recovery
That is why distributed query processing is a second architecture layer rather than a small extension.
---
## Changelog
* **April 1, 2026** -- First version created.