Add note files about joins, aggregations, and distributed engines

2026-04-01 10:35:42 +02:00 · 2026-04-01 10:35:42 +02:00 · 9c6f689447
commit 9c6f689447
parent 8ed8347380
2 changed files with 253 additions and 0 deletions
--- a/hqew/008-joins-and-aggregations.md
+++ b/hqew/008-joins-and-aggregations.md
@ -0,0 +1,146 @@
+# Joins and Aggregations
+
+A reference for two of the most important and expensive families of query operators.
+
+---
+
+## Short answer
+
+Joins combine related data from multiple inputs.
+
+Aggregations collapse many rows into summaries.
+
+These operators matter because they often dominate runtime, memory use, and optimization effort.
+
+---
+
+## Joins
+
+### What a join does
+
+A join matches rows from two inputs according to some condition.
+
+Common cases include:
+
+- equality joins on keys
+- outer joins preserving unmatched rows
+- semi-joins for existence checks
+
+### Common join algorithms
+
+#### Nested-loop join
+
+Compare rows from one side against rows from the other.
+
+Strength:
+
+- simple
+
+Weakness:
+
+- usually expensive on large inputs
+
+#### Hash join
+
+Build a hash table on one side, then probe it with the other side.
+
+Strength:
+
+- often very good for equality joins
+
+Weakness:
+
+- needs memory for the build side
+
+#### Sort-merge join
+
+Sort both sides by join key, then merge them.
+
+Strength:
+
+- useful when inputs are already sorted or ordering is needed
+
+Weakness:
+
+- sorting can be expensive
+
+---
+
+## Why join order matters
+
+If a query has several joins, the engine may have many legal join orders.
+
+Different orders can create radically different intermediate sizes, which is why join planning is one of the hardest and most important optimizer
+tasks.
+
+---
+
+## Aggregations
+
+### What an aggregation does
+
+An aggregation computes summary values such as:
+
+- `COUNT`
+- `SUM`
+- `AVG`
+- `MIN`
+- `MAX`
+
+It may do this:
+
+- globally over all rows
+- per group, such as `GROUP BY department`
+
+### Common aggregation strategies
+
+#### Streaming aggregation
+
+Works well when input is already grouped or sorted appropriately.
+
+#### Hash aggregation
+
+Uses a hash table keyed by grouping columns.
+
+Strength:
+
+- common and flexible
+
+Weakness:
+
+- memory pressure for many groups
+
+#### Partial aggregation
+
+Aggregate locally first, then merge partial results later.
+
+This is especially important in distributed systems.
+
+---
+
+## Why aggregations are tricky
+
+Aggregations are conceptually simple but operationally important because they can:
+
+- change row cardinality dramatically
+- create blocking behavior
+- require state per group
+- interact with nulls and types carefully
+
+So they are simple algebraically but serious at runtime.
+
+---
+
+## Practical mental model
+
+Joins expand and combine structure.
+
+Aggregations compress and summarize structure.
+
+Those two directions explain why they sit at the center of so much query-engine design.
+
+---
+
+## Changelog
+
+* **April 1, 2026** -- First version created.
--- a/hqew/009-distributed-query-engines.md
+++ b/hqew/009-distributed-query-engines.md
@ -0,0 +1,107 @@
+# Distributed Query Engines
+
+A reference for what changes when query execution moves from one machine to many.
+
+---
+
+## Short answer
+
+A distributed query engine is not just a single-node engine with remote workers.
+
+Once execution is spread across machines, the engine needs extra architecture for:
+
+- partitioning data
+- moving data between stages
+- scheduling tasks
+- handling failures
+
+Those concerns become first-order design problems.
+
+---
+
+## The basic shape
+
+Most distributed engines have:
+
+- a coordinator or planner
+- workers or executors
+- partitioned input data
+- exchange or shuffle steps
+- a final result collection stage
+
+The plan is usually broken into fragments or stages that can run in parallel.
+
+---
+
+## What distribution adds
+
+### Partitioning
+
+Data is split across machines, often by file boundaries, shards, or key ranges.
+
+### Exchange / shuffle
+
+Rows or batches are moved across the network so that later operators see the right grouping or join partition.
+
+### Scheduling
+
+The system decides where tasks run and when.
+
+### Fault tolerance
+
+The engine needs some strategy for retries, recomputation, or checkpointing.
+
+### Coordination overhead
+
+Network, serialization, and orchestration can easily dominate runtime if the plan is badly shaped.
+
+---
+
+## Why distributed execution is hard
+
+The main difficulty is that a plan that looks cheap algebraically may be expensive once network movement is included.
+
+For example:
+
+- a join may require shuffling huge datasets
+- a group-by may need global repartitioning by key
+- skewed keys may overload one worker
+
+So distributed optimization is partly about minimizing data movement, not just local operator cost.
+
+---
+
+## Common patterns
+
+### Map-then-reduce style
+
+Local work happens close to the data, then partial results are shuffled and merged.
+
+### Stage DAGs
+
+Execution is represented as a directed acyclic graph of stages separated by exchange boundaries.
+
+### Partial then final aggregation
+
+Workers compute local aggregates, then a later stage merges them.
+
+---
+
+## Practical mental model
+
+Single-node engines mostly optimize CPU, memory, and local I/O.
+
+Distributed engines must also optimize:
+
+- network traffic
+- partition balance
+- task scheduling
+- failure recovery
+
+That is why distributed query processing is a second architecture layer rather than a small extension.
+
+---
+
+## Changelog
+
+* **April 1, 2026** -- First version created.