# Distributed Query Engines A reference for what changes when query execution moves from one machine to many. --- ## Short answer A distributed query engine is not just a single-node engine with remote workers. Once execution is spread across machines, the engine needs extra architecture for: - partitioning data - moving data between stages - scheduling tasks - handling failures Those concerns become first-order design problems. --- ## The basic shape Most distributed engines have: - a coordinator or planner - workers or executors - partitioned input data - exchange or shuffle steps - a final result collection stage The plan is usually broken into fragments or stages that can run in parallel. --- ## What distribution adds ### Partitioning Data is split across machines, often by file boundaries, shards, or key ranges. ### Exchange / shuffle Rows or batches are moved across the network so that later operators see the right grouping or join partition. ### Scheduling The system decides where tasks run and when. ### Fault tolerance The engine needs some strategy for retries, recomputation, or checkpointing. ### Coordination overhead Network, serialization, and orchestration can easily dominate runtime if the plan is badly shaped. --- ## Why distributed execution is hard The main difficulty is that a plan that looks cheap algebraically may be expensive once network movement is included. For example: - a join may require shuffling huge datasets - a group-by may need global repartitioning by key - skewed keys may overload one worker So distributed optimization is partly about minimizing data movement, not just local operator cost. --- ## Common patterns ### Map-then-reduce style Local work happens close to the data, then partial results are shuffled and merged. ### Stage DAGs Execution is represented as a directed acyclic graph of stages separated by exchange boundaries. ### Partial then final aggregation Workers compute local aggregates, then a later stage merges them. --- ## Practical mental model Single-node engines mostly optimize CPU, memory, and local I/O. Distributed engines must also optimize: - network traffic - partition balance - task scheduling - failure recovery That is why distributed query processing is a second architecture layer rather than a small extension. --- ## Changelog * **April 1, 2026** -- First version created.