147 lines
2.5 KiB
Markdown
147 lines
2.5 KiB
Markdown
|
|
# Joins and Aggregations
|
||
|
|
|
||
|
|
A reference for two of the most important and expensive families of query operators.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Short answer
|
||
|
|
|
||
|
|
Joins combine related data from multiple inputs.
|
||
|
|
|
||
|
|
Aggregations collapse many rows into summaries.
|
||
|
|
|
||
|
|
These operators matter because they often dominate runtime, memory use, and optimization effort.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Joins
|
||
|
|
|
||
|
|
### What a join does
|
||
|
|
|
||
|
|
A join matches rows from two inputs according to some condition.
|
||
|
|
|
||
|
|
Common cases include:
|
||
|
|
|
||
|
|
- equality joins on keys
|
||
|
|
- outer joins preserving unmatched rows
|
||
|
|
- semi-joins for existence checks
|
||
|
|
|
||
|
|
### Common join algorithms
|
||
|
|
|
||
|
|
#### Nested-loop join
|
||
|
|
|
||
|
|
Compare rows from one side against rows from the other.
|
||
|
|
|
||
|
|
Strength:
|
||
|
|
|
||
|
|
- simple
|
||
|
|
|
||
|
|
Weakness:
|
||
|
|
|
||
|
|
- usually expensive on large inputs
|
||
|
|
|
||
|
|
#### Hash join
|
||
|
|
|
||
|
|
Build a hash table on one side, then probe it with the other side.
|
||
|
|
|
||
|
|
Strength:
|
||
|
|
|
||
|
|
- often very good for equality joins
|
||
|
|
|
||
|
|
Weakness:
|
||
|
|
|
||
|
|
- needs memory for the build side
|
||
|
|
|
||
|
|
#### Sort-merge join
|
||
|
|
|
||
|
|
Sort both sides by join key, then merge them.
|
||
|
|
|
||
|
|
Strength:
|
||
|
|
|
||
|
|
- useful when inputs are already sorted or ordering is needed
|
||
|
|
|
||
|
|
Weakness:
|
||
|
|
|
||
|
|
- sorting can be expensive
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why join order matters
|
||
|
|
|
||
|
|
If a query has several joins, the engine may have many legal join orders.
|
||
|
|
|
||
|
|
Different orders can create radically different intermediate sizes, which is why join planning is one of the hardest and most important optimizer
|
||
|
|
tasks.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Aggregations
|
||
|
|
|
||
|
|
### What an aggregation does
|
||
|
|
|
||
|
|
An aggregation computes summary values such as:
|
||
|
|
|
||
|
|
- `COUNT`
|
||
|
|
- `SUM`
|
||
|
|
- `AVG`
|
||
|
|
- `MIN`
|
||
|
|
- `MAX`
|
||
|
|
|
||
|
|
It may do this:
|
||
|
|
|
||
|
|
- globally over all rows
|
||
|
|
- per group, such as `GROUP BY department`
|
||
|
|
|
||
|
|
### Common aggregation strategies
|
||
|
|
|
||
|
|
#### Streaming aggregation
|
||
|
|
|
||
|
|
Works well when input is already grouped or sorted appropriately.
|
||
|
|
|
||
|
|
#### Hash aggregation
|
||
|
|
|
||
|
|
Uses a hash table keyed by grouping columns.
|
||
|
|
|
||
|
|
Strength:
|
||
|
|
|
||
|
|
- common and flexible
|
||
|
|
|
||
|
|
Weakness:
|
||
|
|
|
||
|
|
- memory pressure for many groups
|
||
|
|
|
||
|
|
#### Partial aggregation
|
||
|
|
|
||
|
|
Aggregate locally first, then merge partial results later.
|
||
|
|
|
||
|
|
This is especially important in distributed systems.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why aggregations are tricky
|
||
|
|
|
||
|
|
Aggregations are conceptually simple but operationally important because they can:
|
||
|
|
|
||
|
|
- change row cardinality dramatically
|
||
|
|
- create blocking behavior
|
||
|
|
- require state per group
|
||
|
|
- interact with nulls and types carefully
|
||
|
|
|
||
|
|
So they are simple algebraically but serious at runtime.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Practical mental model
|
||
|
|
|
||
|
|
Joins expand and combine structure.
|
||
|
|
|
||
|
|
Aggregations compress and summarize structure.
|
||
|
|
|
||
|
|
Those two directions explain why they sit at the center of so much query-engine design.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Changelog
|
||
|
|
|
||
|
|
* **April 1, 2026** -- First version created.
|