useful-notes/hqew/008-joins-and-aggregations.md

# Joins and Aggregations

A reference for two of the most important and expensive families of query operators.

---

## Short answer

Joins combine related data from multiple inputs.

Aggregations collapse many rows into summaries.

These operators matter because they often dominate runtime, memory use, and optimization effort.

---

## Joins

### What a join does

A join matches rows from two inputs according to some condition.

Common cases include:

- equality joins on keys
- outer joins preserving unmatched rows
- semi-joins for existence checks

### Common join algorithms

#### Nested-loop join

Compare rows from one side against rows from the other.

Strength:

- simple

Weakness:

- usually expensive on large inputs

#### Hash join

Build a hash table on one side, then probe it with the other side.

Strength:

- often very good for equality joins

Weakness:

- needs memory for the build side

#### Sort-merge join

Sort both sides by join key, then merge them.

Strength:

- useful when inputs are already sorted or ordering is needed

Weakness:

- sorting can be expensive

---

## Why join order matters

If a query has several joins, the engine may have many legal join orders.

Different orders can create radically different intermediate sizes, which is why join planning is one of the hardest and most important optimizer
tasks.

---

## Aggregations

### What an aggregation does

An aggregation computes summary values such as:

- `COUNT`
- `SUM`
- `AVG`
- `MIN`
- `MAX`

It may do this:

- globally over all rows
- per group, such as `GROUP BY department`

### Common aggregation strategies

#### Streaming aggregation

Works well when input is already grouped or sorted appropriately.

#### Hash aggregation

Uses a hash table keyed by grouping columns.

Strength:

- common and flexible

Weakness:

- memory pressure for many groups

#### Partial aggregation

Aggregate locally first, then merge partial results later.

This is especially important in distributed systems.

---

## Why aggregations are tricky

Aggregations are conceptually simple but operationally important because they can:

- change row cardinality dramatically
- create blocking behavior
- require state per group
- interact with nulls and types carefully

So they are simple algebraically but serious at runtime.

---

## Practical mental model

Joins expand and combine structure.

Aggregations compress and summarize structure.

Those two directions explain why they sit at the center of so much query-engine design.

---

## Changelog

* **April 1, 2026** -- First version created.