Add note files about joins, aggregations, and distributed engines

2026-04-01 10:35:42 +02:00 · 2026-04-01 10:35:42 +02:00 · 9c6f689447
commit 9c6f689447
parent 8ed8347380
2 changed files with 253 additions and 0 deletions
--- a/hqew/008-joins-and-aggregations.md
+++ b/hqew/008-joins-and-aggregations.md
@ -0,0 +1,146 @@
 # Joins and Aggregations
 A reference for two of the most important and expensive families of query operators.
 ---
 ## Short answer
 Joins combine related data from multiple inputs.
 Aggregations collapse many rows into summaries.
 These operators matter because they often dominate runtime, memory use, and optimization effort.
 ---
 ## Joins
 ### What a join does
 A join matches rows from two inputs according to some condition.
 Common cases include:
 - equality joins on keys
 - outer joins preserving unmatched rows
 - semi-joins for existence checks
 ### Common join algorithms
 #### Nested-loop join
 Compare rows from one side against rows from the other.
 Strength:
 - simple
 Weakness:
 - usually expensive on large inputs
 #### Hash join
 Build a hash table on one side, then probe it with the other side.
 Strength:
 - often very good for equality joins
 Weakness:
 - needs memory for the build side
 #### Sort-merge join
 Sort both sides by join key, then merge them.
 Strength:
 - useful when inputs are already sorted or ordering is needed
 Weakness:
 - sorting can be expensive
 ---
 ## Why join order matters
 If a query has several joins, the engine may have many legal join orders.
 Different orders can create radically different intermediate sizes, which is why join planning is one of the hardest and most important optimizer
 tasks.
 ---
 ## Aggregations
 ### What an aggregation does
 An aggregation computes summary values such as:
 - `COUNT`
 - `SUM`
 - `AVG`
 - `MIN`
 - `MAX`
 It may do this:
 - globally over all rows
 - per group, such as `GROUP BY department`
 ### Common aggregation strategies
 #### Streaming aggregation
 Works well when input is already grouped or sorted appropriately.
 #### Hash aggregation
 Uses a hash table keyed by grouping columns.
 Strength:
 - common and flexible
 Weakness:
 - memory pressure for many groups
 #### Partial aggregation
 Aggregate locally first, then merge partial results later.
 This is especially important in distributed systems.
 ---
 ## Why aggregations are tricky
 Aggregations are conceptually simple but operationally important because they can:
 - change row cardinality dramatically
 - create blocking behavior
 - require state per group
 - interact with nulls and types carefully
 So they are simple algebraically but serious at runtime.
 ---
 ## Practical mental model
 Joins expand and combine structure.
 Aggregations compress and summarize structure.
 Those two directions explain why they sit at the center of so much query-engine design.
 ---
 ## Changelog
 * **April 1, 2026** -- First version created.
--- a/hqew/009-distributed-query-engines.md
+++ b/hqew/009-distributed-query-engines.md
@ -0,0 +1,107 @@
 # Distributed Query Engines
 A reference for what changes when query execution moves from one machine to many.
 ---
 ## Short answer
 A distributed query engine is not just a single-node engine with remote workers.
 Once execution is spread across machines, the engine needs extra architecture for:
 - partitioning data
 - moving data between stages
 - scheduling tasks
 - handling failures
 Those concerns become first-order design problems.
 ---
 ## The basic shape
 Most distributed engines have:
 - a coordinator or planner
 - workers or executors
 - partitioned input data
 - exchange or shuffle steps
 - a final result collection stage
 The plan is usually broken into fragments or stages that can run in parallel.
 ---
 ## What distribution adds
 ### Partitioning
 Data is split across machines, often by file boundaries, shards, or key ranges.
 ### Exchange / shuffle
 Rows or batches are moved across the network so that later operators see the right grouping or join partition.
 ### Scheduling
 The system decides where tasks run and when.
 ### Fault tolerance
 The engine needs some strategy for retries, recomputation, or checkpointing.
 ### Coordination overhead
 Network, serialization, and orchestration can easily dominate runtime if the plan is badly shaped.
 ---
 ## Why distributed execution is hard
 The main difficulty is that a plan that looks cheap algebraically may be expensive once network movement is included.
 For example:
 - a join may require shuffling huge datasets
 - a group-by may need global repartitioning by key
 - skewed keys may overload one worker
 So distributed optimization is partly about minimizing data movement, not just local operator cost.
 ---
 ## Common patterns
 ### Map-then-reduce style
 Local work happens close to the data, then partial results are shuffled and merged.
 ### Stage DAGs
 Execution is represented as a directed acyclic graph of stages separated by exchange boundaries.
 ### Partial then final aggregation
 Workers compute local aggregates, then a later stage merges them.
 ---
 ## Practical mental model
 Single-node engines mostly optimize CPU, memory, and local I/O.
 Distributed engines must also optimize:
 - network traffic
 - partition balance
 - task scheduling
 - failure recovery
 That is why distributed query processing is a second architecture layer rather than a small extension.
 ---
 ## Changelog
 * **April 1, 2026** -- First version created.